Abstract
Experimental designs such as matched-pair or longitudinal studies yield mRNA sequencing (mRNA-Seq) counts that are correlated across samples. Most of the approaches for the analysis of correlated mRNA-Seq data are restricted to a specific design and/or balanced data only (with the same number of samples in each group). We propose a model that is applicable to the analysis of correlated mRNA-Seq data of different types: paired, clustered, longitudinal, or others. Any combination of explanatory variables, as well as unbalanced data, can be processed within the proposed modeling framework. The model assumes that exon counts of a particular gene of an individual sample jointly follow a multivariate negative-binomial distribution. Additional correlation between exon counts obtained for, for example, individual samples within the same pair or cluster, is taken into account by including into the model a cluster-level normally distributed random effect. An interesting feature of the model is that it provides explicit expression for marginal correlation between exon counts at different levels. The performance of the model is evaluated by using a simulation study and an analysis of two real-life data sets: a paired mRNA-Seq experiment for 24 patients with clear-cell renal-cell carcinoma and a longitudinal mRNA-Seq experiment for 29 patients with Lyme disease.
Introduction
Matched-pair experiments or longitudinal studies may yield mRNA-Seq counts that are correlated across samples. A number of methods have been proposed to analyze data from such experiments. For instance, Pham and Jimenez (2012), Hardcastle and Kelly (2013), and Chung et al. (2013) have developed methods for matched-pair data. In contrast, for longitudinal studies and/or other types of clustered experiments, several methods have been introduced (Spies and Ciaudo, 2015), including PLNseq (Zhang et al., 2015).
In this article, we propose a hierarchical model for differential gene-expression analysis of correlated mRNA-Seq data based upon exon counts. Exons are basic units in transcription. Within a gene, expression of individual exons varies due to, among other things, differences in exon lengths and alternative isoform regulation. Conventional methods for differential gene-expression analysis use summarized gene levels and ignore exon-expression variability. Some methods assume that expression of a single exon of a gene has to necessarily lead to differential expression of the gene (Anders et al., 2012). We propose to acknowledge the variation in exon expression in inference about gene expression by using a multivariate distribution for the exon-expression levels.
The model we propose includes two types of random effects that account for the within-sample correlation (individual random effects) and between-sample correlation (cluster random effects) of exon counts. The individual effects follow a gamma distribution, whereas the cluster effects follow a normal distribution. Consequently, conditionally on the cluster effect, exon counts for the same gene in a particular sample follow a multivariate negative-binomial (MVNB) distribution (Fabio et al., 2012). Essentially, the proposed model falls in the framework developed by Molenberghs et al. (2010). An important advantage of the model is that it can be applied to data coming from various designs that may yield correlated mRNA-Seq counts, including matched pairs, clustered sampling, and longitudinal studies. Interestingly, it allows computing conditional and marginal correlation coefficients, which may offer insight into the correlation structure of the data.
Methodology
We consider per gene analysis. Thus, in what follows, we drop the index indicating the gene.
Assume that a gene consists of J exons. Exon counts are collected for a set of samples that may be correlated: we observe N clusters, each with Nc samples. Denote by
To account for the within-sample correlation between the exon counts, as well as between the samples from the same cluster, we define the following hierarchical models:
where
The model implies that, conditionally on bc,
where
Moreover, the variance of ycsj is
and, for two exon counts from the same sample,
The marginal likelihood of models (1)–(3) is given by
with P(
Maximization of Equation (8) allows obtaining estimates of the parameters. The integral involved in Equation (8) can be computed by using the adaptive Gaussian–Hermite quadrature. In our study we used 10 quadrature points. Variance–covariance matrix of the estimated parameters is obtained from the inverse of the negative Hessian of the logarithm of Equation (8).
For the hierarchical models (1)–(3), it is possible to derive (Molenberghs et al., 2010) the marginal moments. The mean and variance of ycsj are given by, respectively,
with Kcsj defined in Equation (5). The marginal covariance is given by
Thus, correlation between two exon counts from the same sample (s = t) is positive and stronger than for the same exons from different samples (s ≠ t) in the same cluster. Given that the correlation is a function of Kcsj, it differs for different pairs of exon counts, even for the same sample, unless the exons have the same length. For a fixed pair of exons, the correlation differs for different samples, unless the library sizes and sample-specific covariates are the same.
To investigate the performance of our model, we conducted simulation studies. We also applied the model to two real-life mRNA-Seq data sets.
We considered settings of a matched-pair design and of a longitudinal experiment. For each setting, we generated 10,000 data sets with exon counts for a gene that consists of three exons. Data were generated by using models (1)–(3). We also generated data from a conditional-independence model where, conditionally on the normally distributed random effect, exon counts followed independent negative-binomial distributions with overdispersion parameter ϕ.
A simulated matched-pair mRNA sequencing experiment
We assumed that samples within each pair originated from two different biological conditions, “control” and “experimental,” say. Thus,
The considered scenarios are listed in Table 1. We assumed β1 = 0, 0.12, or 1, and combined it with N = 12, 24, or 36 to study the Type-I error probability and power in function of the sample size. The value of ϕ was set to be equal to 54.6, 7.4, and 2.7 to investigate the effect of the within-sample correlation between exon counts, equal to 0.51, 0.87, and 0.95, respectively. The value of σ was set to 0.2 or 0.4 to evaluate the effect of increasing marginal between-sample correlation.
Parameters of the Simulations (10,000 Replicates Each) of a Matched-Pair mRNA Sequencing Experiment for One Gene Consisting of Three Exons
Parameters of the Simulations (10,000 Replicates Each) of a Matched-Pair mRNA Sequencing Experiment for One Gene Consisting of Three Exons
N is the number of pairs. σ (ln σ), ϕ (ln ϕ), and β1 refer to models (1)–(3). In all scenarios β0 = 0.5. ρ|b and ρ*|b are the conditional correlations between exon 1 and exon 3 for control and experimental samples, respectively, for a pair with the random effect equal to zero (bc = 0). ρs, s and ρs, t are the marginal correlations between exon 1 and exon 3 within the same control sample and for different samples within the same pair, respectively. All correlations were calculated from the true parameter values.
We assumed that samples from the same individual were obtained at three different time points. Thus,
Table 2 presents the considered simulation scenarios. We were particularly interested in the Type-I-error probability and power of the Wald test for the following two null hypotheses:
Parameters of the Simulations (10,000 Replicates Each) of a Longitudinal mRNA Sequencing Experiment with One Gene Consisting of Three Exons
Parameters of the Simulations (10,000 Replicates Each) of a Longitudinal mRNA Sequencing Experiment with One Gene Consisting of Three Exons
N is the number of clusters. σ (ln σ), ϕ (ln ϕ), β1, and β2 refer to models (1)–(3). In all scenarios β0 = 0.5. ρ|b and ρ*|b are the conditional correlations between exon 1 and exon 3 for an individual with the random effect equal to zero (bc = 0) at the first and third time points, respectively. ρs, s is the marginal correlation between exon 1 and exon 3 within the same sample at the first time point. ρs, t is the marginal correlation between exon 1 from a sample collected at the first time point and exon 3 from a sample collected at the third time point. All correlations were calculated from the true parameter values.
Rejection of
Metastases in clear cell renal-cell carcinoma (RCC) are associated with poor treatment outcomes (Capitanio and Montorsi, 2016). Recent drug discovery strategies in metastatic RCC have been directed to specific targets in a few biological pathways (Capitanio and Montorsi, 2016). It is, therefore, important to identify genes associated with differences in metastatic status.
The data set contained preprocessed data from an mRNA-Seq experiment for 12 patients with a metastatic disease (at diagnosis) matched on the tumor stage, size, grade, and necrosis (SSIGN) score (Frank et al., 2002) with 12 nonmetastatic (at diagnosis) patients. The (unpublished) data were obtained from the Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN.
For each patient, expression of 22,334 genes was quantified, yielding measurements for the total of 234,575 exons. There were 24,158 exons with zero counts across all the samples. The number of exons varied between 1 and 468 (mean 10.5, median 7) per gene.
Lyme disease longitudinal study
Lyme disease is a tick-borne infection. Some patients report lingering or recurring symptoms lasting long after antibiotic treatment (Bouquet et al., 2016). Pathogenetic molecular mechanisms behind persistent post-Lyme symptoms are not well understood (Strle et al., 2014; Bouquet et al., 2016). Identifying the genes associated with the dynamics of Lyme disease might elucidate these mechanisms.
Bouquet et al. (2016) conducted a study in which mRNA-Seq data were collected from 29 patients with tick-borne Lyme disease and from 13 healthy controls. We downloaded these freely available data from (https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP049605).
In our analysis, we only use the patient samples. From each patient, blood samples were taken three times: before the antibiotic treatment, immediately after completion of the treatment, and 6 months after treatment completion.
Eighty-seven libraries were sequenced as 100-bp paired-end runs on a HiSeq 2500 (Illumina). Three of them were discarded because of insufficient read counts and transcriptome coverage. We analyzed only single-end reads to limit the number of unaligned reads. The preprocessing steps included removal of the 5′ and 3′ adapters, and trimming low-quality ends from the reads using cutadapt (Martin, 2011). Processed reads shorter than 50 bp were discarded.
Single-end processed reads were mapped to the human genome (GRCh38) with Bowtie (Langmead et al., 2009), allowing for up to two mismatches and reporting the best mapping location for each alignment. The read counting was performed by using the R function summarizeOverlaps (Lawrence et al., 2013) according to the exon annotation (GRCh38.82). As a result, 195,785 exons with nonzero sum of counts across all samples were included in the analysis. Finally, we used 19,808 protein-coding genes from the annotation file to group exons into genes.
Results
A simulated matched-pair mRNA sequencing experiment
The results presented in Table 3 for scenarios (1)–(3) indicate that, when β1 = 0, both β0 and β1 are estimated with a negligible bias. The model-based standard errors (SEs) slightly overestimate, on average, the empirical SEs. For β1, the estimated coverage of the 95% confidence interval (CI) is slightly <95%. Given that the simulation SE is equal to about
A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (1)–(3)
A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (1)–(3)
No differential expression setting (see Table 1). SEmodel and SEemp are the mean model-based and empirical standard error estimates, respectively.
CI, confidence interval; NA, not available.
Table 4 shows the results for scenarios (4)–(7) in which β1 ≠ 0. Similarly to the case of no differential expression, β0 and β1 are estimated with almost no bias. The coverage of the 95% CI of β1 is slightly <95%. The power for testing β1 = 0 increases with N: for β1 = 0.12 it is equal to ∼0.5 for N = 12 and 0.8 for N = 24. For β1 = 1, the power is essentially equal to 1 even for N = 12.
A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (4)–(7)
Differential expression setting (see Table 1). SEmodel and SEemp are the mean model-based and empirical standard error estimates, respectively.
The results for scenarios (8)–(15) are presented in the Supplementary Table S1. They lead to conclusions very similar to those already presented.
Table 5 presents results for scenarios (16)–(17), that is, for the conditional-independence model. As compared with scenarios (6) and (7), there is essentially no difference in the bias nor precision of estimation of β0 and β1. Thus, the model-based estimates of the parameters seem to be robust to this type of misspecification of the variance–covariance structure.
A Simulated Matched-Pair mRNA Sequencing Experiment for Scenarios (16) and (17)
A conditional-independence model (see Table 1). SEmodel and SEemp are the mean model-based and empirical standard error estimates, respectively.
Complete results related to the simulation study are available in Supplementary File S3.xlsx. In what follows, we summarize the main points.
For scenario (1) (see Table 2), the estimated Type-I-error probability of the Wald tests for
In scenario (3), the estimated Type-I-error probability of testing
Overall, the power of testing the hypotheses increased as the number of clusters increased. For example, for scenario (5) with N = 8 clusters, the estimated power was equal to 0.83 and 0.28 for
Table 6 presents more detailed results for scenarios (1) and (2). Similarly to the matched-pair case (see Table 3), under the null hypothesis, the fixed effects (β0, β1, and β2) are estimated with a negligible bias. The coverage of the 95% CIs is higher than the nominal level, but decreases when the number of clusters increases.
A Simulated Longitudinal mRNA Sequencing Experiment for Scenarios (1) and (2)
A Simulated Longitudinal mRNA Sequencing Experiment for Scenarios (1) and (2)
A gene without differential expression (see Table 2). SEmodel and SEemp are the mean model-based and empirical standard error estimates, respectively.
The results for scenarios (3)–(11) (see the Supplementary Table S2) also confirm the adequate performance of the model-based estimates of β0, β1, and β2.
We applied our model and PLNseq (Zhang et al., 2015) to the RCC data set. In the analyses, the Benjamini–Hochberg (BH) (Benjamini and Hochberg, 1995) procedure was applied to correct for multiple testing.
To produce gene counts for PLNseq, we summed the relevant exon counts. The current implementation of PLNseq requires gene counts to be ≥50 in each condition. As a consequence, 21% of 22,334 genes had to be excluded from the analysis. Among the remaining 17,528 genes, 133 were found to be statistically significantly differentially expressed between the nonmetastatic and metastatic samples.
In contrast to PLNseq, our model does not require any minimum value of an exon count to include the count in the analysis. However, for 6.5% (1448) of the genes, we could not obtain model-based estimates due to nonconvergence (convergence criteria are described in Section A.4.3 in Supplementary Materials). Only 8 of the remaining 21,786 genes were found to be statistically significantly differentially expressed. None of these eight genes was identified by PLNseq.
Tick-borne Lyme disease longitudinal study
We applied our model and PLNseq, both combined with the BH multiple-testing procedure, to the Lyme data set.
Only 8760 genes could be analyzed by PLNseq, because the remaining 11,048 (56%) genes had a count <50 in any of the conditions. Moreover, as PLNseq cannot handle unbalanced data, three patients with only two samples had to also be discarded.
The current implementation of PLNseq can only test null hypothesis
An advantage of our model is that it can handle unbalanced data. Hence, we could analyze the data for all the Lyme disease samples.
Our model did not converge for 1680 (8.5%) genes. Among 18,128 genes for which no convergence issues were noted,
Conclusion
We have presented a model for analyzing differential gene expression in correlated mRNA-Seq data based on exon-level counts. The model accounts for the within- and between-sample correlation between the exon counts for a particular gene. It can be applied to various designs yielding correlated mRNA-Seq data (matched pairs, longitudinal studies, and clustered sampling).
Simulation studies show that the model is able to correctly estimate differential expression, even when there is no within-sample correlation.
Performance of our model has been compared with PLNSeq using two real-life experiments. In contrast to PLNseq, our method can handle unbalanced data, and a substantially larger number of genes can be tested. In addition, our model allows testing various hypotheses related to the factors that might influence gene-expression levels.
It is possible to use our model for the analysis of differential gene expression in correlated mRNA-Seq data with gene counts as input. The user could run our model in the same way as if there were just one exon per gene.
One noticeable drawback of our model is that it assumes that, after adjusting for the exon length and library size, all exons within the same gene have the same level of expression. However, exon expression can vary due to alternative splicing. In our model, alternative splicing may be only partially accounted for by using the overdispersion parameter ϕ. An extension of the model that would explicitly adjust for multiple isoforms of a gene is a topic for further research.
Our method is computationally intensive. It required roughly 5–6 hours per 1000 genes on one core for the real-life data sets described in this study. The study was carried out on either Intel Xeon E5-2670 v3 or Intel Xeon E5-2697 v3 hardware.
The proposed model has been implemented in Python; the code is available at (https://sourceforge.net/projects/dgeee/).
Footnotes
Acknowledgments
We thank Wacław Andrzej Sokalski for enabling this collaboration. We also thank Jeanette E. Eckel Passow and Daniel J. Serie for providing RCC matched-pair experiment data and for helpful discussions. This study was supported by the Wrocław Centre for Networking and Supercomputing (Grant No. 255). D.P., T.B., and D.K. were supported by the Medical University of Białystok. D.P. was supported by the Polish National Science Centre (2014/15/B/ST6/05082) and Foundation for Polish Science (TEAM to D.P.). D.K. and K.G. were supported by BOF bilateral scientific cooperation grants R-6244 and R-7699.
Author Disclosure Statement
The authors declare they have no conflicting financial interests.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
