A Conceptual Framework for Abundance Estimation of Genomic Targets in the Presence of Ambiguous Short Sequencing Reads

Abstract

RNA sequencing (RNA-seq) is widely used to study gene-, transcript-, or exon expression. To quantify the expression level, millions of short sequenced reads need to be mapped back to a reference genome or transcriptome. Read mapping makes it possible to find a location to which a read is identical or similar. Based upon this alignment, expression summaries, that is, read counts are generated. However, reads may be matched to multiple locations. Such ambiguously mapped reads are often ignored in the analysis, which is a potential loss of information and may cause bias in expression estimation. We present the general principles underlying multiread allocation and unbiased estimation of the expression level of genes, exons, or transcripts in the presence of multiple mapped reads. The underlying principles are derived from a theoretical concept that identifies important sources of information such as the number of uniquely mapped reads, the total target length, and the length of the shared target regions. We show with simulation studies that methods incorporating some or all of the aforementioned sources of information estimate the expression levels of genes, exons, and/or transcripts with a higher precision and accuracy than methods that do not use this information. We identify important sources of information that should be taken into account by methods that estimate the abundance of genes, exons, and/or transcripts to achieve good precision and accuracy.

1. Introduction

High-throughput complementary DNA (cDNA)/RNA sequencing (RNA-seq) is a widely used technology in biological and (bio-)medical sciences. RNA-seq experiments enable investigation of expression of genes, transcripts, or exons, and allow for differential expression analysis. As such, next-generation sequencing (NGS) has become an important tool to study molecular biology in humans, animals, and plants (Lister et al., 2008; Mortazavi et al., 2008; Eswaran et al., 2013; Wickramasinghe et al., 2014). Many algorithms and applications have been already developed for RNA-seq data analysis, including data preprocessing tools (Anders et al., 2015; Andrews, 2016), mapping software (Li et al., 2008a, 2008b; Langmead et al., 2009), and statistical packages for comparing expression levels between experimental conditions (Anders and Huber, 2010; Robinson et al., 2010; Anders et al., 2012). Despite these developments, RNA-seq data analysis still faces several challenges. These include, for instance, de novo assembly (Zerbino and Birney, 2008; Grabherr et al., 2011), isoform expression analysis (Li and Jiang, 2012; Nariai et al., 2014), identification of novel transcripts (Trapnell et al., 2010; Conesa et al., 2016), or expression quantification (Li and Dewey, 2011). In this article, we focus on the issue of expression estimation in the presence of so-called multireads or ambiguous reads (Taub et al., 2010).

A typical RNA-seq experiment produces millions of short reads. Mapping of these reads to the genome or transcriptome is an essential step in expression quantification. Based upon the alignment, the expression level can be estimated based on the number of reads mapping to a genomic or transcriptomic location. In practice, due to sequence similarities, sequencing errors, and hidden nucleotide variations, it may happen that a large fraction of reads [even up to 36% of all reads according to Consiglio et al. (2016)] can be aligned to more than one target (gene or transcript) (Mortazavi et al., 2008). These reads are often ignored when establishing the expression levels. However, discarding multireads may lead to underestimation of the true expression levels (Grabherr et al., 2011). To avoid introducing bias, multireads should be taken into account (Li et al., 2008a; Taub et al., 2010). The results of RNA-seq quantification may also vary depending on other factors such as the choice of the reference sequence, mapping and quantification algorithms, and read length (RL) (Li and Jiang, 2012; Robert and Watson, 2015). It has also been shown that different sources of information, such as the number of uniquely mapped reads or the reference target length, affect the quantification results as well (Li et al., 2010; Roberts and Pachter, 2013).

Several methods to handle the issue of multireads have been proposed. The simplest approach, called the unique (or naïve) assignment method (Bentley et al., 2008; Marioni et al., 2008), discards all multireads in downstream analysis such as expression estimation. Despite the fact that, as already mentioned, discarding multiple-mapped reads may bias the estimates of expression levels, this approach is often used in the analysis of RNA-seq experiments.

An alternative approach, the so-called proportional (or rescue) assignment method, assigns multireads proportionally to the uniquely mapped reads (Mortazavi et al., 2008; Hashimoto et al., 2009). It is based on the assumption that the allocation of the multireads corresponds to the distribution of the locations of the uniquely mapped reads.

Following these two methods, a number of approaches for handling mapping uncertainty have been developed. The vast majority of these methods uses the expectation–maximization (EM) algorithm to estimate expression levels (Li et al., 2010; Taub et al., 2010; Trapnell et al., 2010; Roberts and Pachter, 2013). Besides the EM framework, the cumulative assignment method (Taub et al., 2010) and a Bayesian-based method have been proposed (Ji et al., 2011). The former uses a cumulative distribution function and incorporates a score for the alignment. The latter computes the mapping probability for each multiread based on the Bayesian model.

In this article, we present general principles of target expression estimation in the presence of multireads. In particular, we show that three sources of information, that is, the observed number of uniquely mapped reads, the total length (TL) of the target (gene, transcript, or exon), and the lengths of the shared, or overlapping, target regions, are important when estimating expression levels in the presence of multireads. The first two pieces of information are used by RNA-Seq by expectation maximization (RSEM) (Li et al., 2010; Li and Dewey, 2011) and eXpress (Roberts and Pachter, 2013).

In particular, we show, with simulations, that methods using these sources of information estimate the expression levels of genes, exons, or transcripts, in the presence of multireads, in an accurate and precise manner. In addition, we show that the assumptions underlying the unique method and the proportional assignment method are too crude and can lead to biased estimates of expression levels.

The presented conceptual framework may allow abundance estimation with a set of equations. The use of this set of equations, to determine the absolute expression level of a genomic target, removes the need of an estimation method such as the EM-algorithm.

2. Methods

2.1. Repetitive sequences and ambiguity problem

A common approach to RNA-seq consists of extracting RNA, preparing a collection of shorter fragments, converting them into cDNA, and sequencing them using an NGS-based platform, from either one (single end) or both ends (paired end) of each cDNA fragment (Oshlack et al., 2010). As a consequence, millions of short sequences (reads) are obtained. To measure the expression levels of genes or transcripts, these reads are aligned to a reference sequence and, subsequently, are summarized for genomic or transcriptomic locations. The problem with read counting occurs when some sequence locations are similar or identical. In that case, a number of reads will map to more than one region. Without correct assignment, we are no longer able to properly assess the expression levels of that region. In the remainder of the article, we will refer to a read mapped to only one target as a “unique read,” and to a read mapped to multiple targets as a “multiread” or “shared read.”

Considering reference targets (genes or transcripts) with repetitive sequences, we can distinguish the nonoverlapping (or unique) regions and the overlapping (or shared) regions. To avoid confusion, for a particular target we use the term “nonoverlapping region” to denote any segment that is not part of any other target, and “overlapping region” to denote any segment that is shared with some other targets. The term “shared segments” refers to regions that are identical in nucleotide composition, or highly similar, that is, with a limited number of mismatches.

2.2. Theoretical abundance estimation

To gain insight into the underlying principles of ambiguous read allocation and expression-level quantification, we introduce a theoretical/conceptual framework. The principles are graphically illustrated.

The underlying idea of this framework is that the number of multireads originating from a target is proportional to the number of unique reads from the same target, while accounting for the length of the shared region and the length of the unique region. In this section, we derive the proposed conceptual framework.

2.2.1. A simple approach

Consider a genomic target A which shares an identical sequence segment with one or more targets. For each target, the TL, the length of the shared or overlapping region (OL), and the length of the unique regions (UL) are known. The total number of short sequence reads that are uniquely mapped to target A, $U R_{A}$ , can be determined by a short-read alignment algorithm. Assuming that short sequence reads are uniformly distributed along the target, the total number of reads originating from target A, $T O T_{A}$ , is given by $T O T_{A} = \frac{T L_{A}}{U L_{A}} * U R_{A} .$ (1)

As a consequence, the number of multireads that truly belong to target A, $M R_{A}$ , can be determined as follows: $M R_{A} = T O T_{A} - U R_{A} = \frac{T L_{A}}{U L_{A}} * U R_{A} - U R_{A} = (\frac{T L_{A} - U L_{A}}{U L_{A}}) * U R_{A} = \frac{O L_{A}}{U L_{A}} * U R_{A} = \frac{O L_{A}}{T L_{A}} * T O T_{A} .$ (2)

2.2.2. Edge effect correction

According to Oshlack et al. (2010), short sequence reads are not uniformly distributed across a genomic target (Fig. 1). The deviation from the uniform distribution of short sequence reads at the edges of the genomic targets can be explained by the fragmentation and sequencing process. During fragmentation, the extracted messenger RNA (mRNA) is broken into smaller pieces. The start positions of the generated fragments range between the first base of the mRNA sequence and the nucleotide at position $T L - (F L - 1)$ , where FL is the fragment length. The start position of the rightmost fragment cannot be further than $T L - (F L - 1)$ from the end of the genomic target, as fragments of length FL cannot be generated from the positions $T L - (F L - 2), T L - (F L - 3), \dots, T L - 1, T L$ . For reads from the complementary strand, the same effect can be seen at the start of the genomic target. We refer to this situation as the “target-edge effect.”

FIG. 1.

Distribution of read start position as a percentage of length for a certain genomic target. The number of mapped reads is 32,018. The black line is a cubic spline fit.

If a genomic target contains one or more shared sequence segments, additional edge effects have to be taken into account. To be considered as a multiread, a short sequence read should map fully within the shared region, when ignoring the possibility of sequencing errors and not allowing for mismatches in the alignment process. Hence, the length of the overlapping regions should also be corrected with edge effects. The starting position of the multireads cannot be past $O L - (R L - 1)$ , where RL is the read length (“overlapping region edge effect”). For multireads from the complementary strand, the overlapping region has to be extended with $F L - R L$ at the left-hand side (“overlapping complementary-strand edge effect”; Fig. 2).

FIG. 2.

The possible edge effects for the genomic target and the overlapping region. The light grey segments indicate unique regions, whereas the darker grey segment is shared with one or more genomic targets. The short sequence reads below the genomic target are sequenced from fragments coming from the first strand (dark grey), while the reads above are from the complementary fragments (black). Shown are three edge effects: the target-edge effects (FL-1), the overlapping region edge effect RL-1, and the overlapping region complementary edge effect, FL-RL. FL, fragment length; RL, read length.

2.2.3. Abundance estimation using edge effects

To reliably estimate the abundance and to assess the number of truly mapped ambiguous reads, the simple approach has to incorporate the sequencing FL and the edge effect corrections. Ignoring this information, as in Equations (1) and (2), leads to biased estimates. These corrections depend on the location of the overlapping regions; that is, whether they are located at the beginning of the genomic target, at the end, or in between. We propose three different formulas to estimate the expression level, depending on the location of the overlapping region.

The total number of reads $T O T_{A}$ is the sum of the total number of reads from one strand of A, $T O T_{A}^{1}$ , and the total number of reads from the complementary strand $T O T_{A}^{c o m p l}$ . Assuming that approximately half of the total reads originate from one strand, and that the other half comes from the complementary strand, the number of unique reads can be calculated from Equation (1) as follows: $U R_{A} = T O T_{A} * \frac{U L_{A}}{T L_{A}} = T O T_{A} * (1 - \frac{O L_{A}}{T L_{A}}) = T O T_{A}^{1} * (1 - \frac{O L_{A}}{T L_{A}}) + T O T_{A}^{c o m p l} * (1 - \frac{O L_{A}}{T L_{A}})$ $\approx \frac{1}{2} * T O T_{A} * (1 - \frac{O L_{A}}{T L_{A}}) + \frac{1}{2} * T O T_{A} * (1 - \frac{O L_{A}}{T L_{A}}) .$ (3)

Then, the total number of reads can be determined: $T O T_{A} = \frac{2 * U R_{A}}{(1 - \frac{O L_{A}}{T L_{A}}) + (1 - \frac{O L_{A}}{T L_{A}})} = \frac{2 * U R_{A}}{2 - \frac{O L_{A}}{T L_{A}} - \frac{O L_{A}}{T L_{A}}} .$ (4)

When the shared segment is located at the center of target A, the total number of reads can be calculated, while incorporating edge effect corrections, as follows: $T O T_{A} = \frac{2 * U R_{A}}{2 - \frac{O L_{A} - (R L - 1)}{T L_{A} - (F L - 1)} - \frac{O L_{A} - (F L - 1) + (F L - R L)}{T L_{A} - (F L - 1)}} = \frac{U R_{A}}{\frac{T L_{A} - O L - F L + R L}{T L_{A} - (F L - 1)}} . (5)$

If target A contains a shared region at the end, the overlapping complementary-strand edge effect and the target-edge effect occur. As a consequence, the total number of reads originating from target A is equal to $T O T_{A} = \frac{2 * U R_{A}}{2 - \frac{O L_{A} - (F L - 1)}{T L - (F L - 1)} - \frac{O L_{A} - (F L - 1) + (F L - R L)}{T L - (F L - 1)}} = \frac{2 * U R_{A}}{\frac{2 * (T L_{A} - O L_{A}) - (F L - R L)}{T L_{A} - (F L - 1)}} .$ (6)

When target A has an overlapping region at the start, this region cannot be extended with the complementary-strand edge effect, but is shortened by the overlapping edge effect. The target length has to be corrected with the target-edge effect: $T O T_{A} = \frac{2 * U R_{A}}{2 - \frac{O L_{A} - (R L - 1)}{T L_{A} - (F L - 1)} - \frac{O L_{A} - (F L - 1)}{T L_{A} - (F L - 1)}} = \frac{2 * U R_{A}}{\frac{2 * (T L_{A} - O L_{A}) - (F L - R L)}{T L_{A} - (F L - 1)}} .$ (7)

In case of paired-end sequencing, the three different formulas for the edge effect correction reduce to the following formula: $T O T_{A} = \frac{U R_{A}}{1 - \frac{O L_{A} - (F L - 1)}{T L_{A} - (F L - 1)}} = \frac{U R_{A}}{\frac{T L_{A} - O L_{A}}{T L_{A} - (F L - 1)}} .$ (8)

2.2.4. Alignment mismatches

At this point, we have not considered the possibility for allowing mismatches during the alignment process. Allowing for mismatches will lead to a higher number of short sequence reads that align to more than one genomic location. As a result, the number of unique reads will decrease. A fraction of the multireads will not map equally well to multiple locations; that is, a multiread may be aligned to multiple genomic locations with a different number of mismatches. Information about the number of mismatches can be retrieved from the BAM and/or SAM-file. We propose to consider multireads with an equal number of mismatches for different mapping locations to be “true” multireads. All other ambiguous reads will be assigned as unique reads to the genomic target with the lowest mismatches. With this approach, we ignore the possibility that a mismatch is caused by a sequencing error. Given the low sequencing error probability, a negligible bias in the estimated expression level can be expected.

2.3. Expression-level estimation by the unique method, proportional method, eXpress, and RSEM

Consider two reference targets T₁ and T₂ with expression level (abundance) given by the number of mapped reads $T O T_{1}$ and $T O T_{2}$ , respectively. Denote by $U R_{1}$ and $U R_{2}$ the number of reads uniquely mapped to T₁ and T₂, respectively. Assume further that these targets share a segment. Consequently, among all mapped reads, a certain number of reads will be aligned to both targets. The number of ambiguous reads shared by these targets is denoted by $M R_{t o t} = M R_{1} + M R_{2}$ , where $M R_{1}$ is the number of multireads coming from T₁, and mapped to T₂, and $M R_{2}$ is the number of multireads coming from T₂ that are mapped to T₁.

The simplest assignment method, the so-called unique method, estimates the target expression level by using solely uniquely mapped reads. In our example, this method yields expression-level (abundance) estimates equal to $U R_{1}$ and $U R_{2}$ for T₁ and T₂, respectively. As the number of unique reads is smaller than $T O T_{1}$ and $T O T_{2}$ , it is obvious that the abundance is underestimated.

The proportional method (Mortazavi et al., 2008; Hashimoto et al., 2009) estimates the fraction of multireads assigned to each reference target based on the number of uniquely mapped reads. For target T₁, this fraction is expressed by $\frac{U R_{1}}{U R_{1} + U R_{2}} * M R_{t o t}$ (Mortazavi et al., 2008; Hashimoto et al., 2009). The overall expression level of T₁ is then estimated by $U R_{1} + \frac{U R_{1}}{U R_{1} + U R_{2}} * M R_{t o t}$ .

The underlying idea of our theoretical framework (TF), and of RSEM (Li et al., 2010; Li and Dewey, 2011) and eXpress (Roberts and Pachter, 2013) is identical; that is, shared reads can be divided across genes, isoforms, and/or exons based on the unique reads, and the length of the genomic targets. While the proposed TF suggests the possibility of using a set of equations to determine the expression levels, RSEM and eXpress estimate the expression levels iteratively.

In case of RSEM, the probability, $θ_{i}$ , that a read is derived from a genomic target i, is estimated based on the genomic target length, the FL distribution, the RL distribution, and the alignment results. This probability is estimated by using the EM algorithm: $θ_{i} = U R_{i} + M R_{i} \times τ_{i} ∕ T O T_{i},$ (9)

with $M R_{i}$ the number of shared reads aligned to target i, and $τ_{i} = \frac{θ_{i} ∕ l_{i}^{'}}{\sum_{k = 1}^{M} θ_{k} ∕ l_{k}^{'}},$ (10)

where $l_{i}^{'}$ is the effective length of the genomic target and M is the total number of genomic targets. The effective length is calculated based on the genomic target length, the RL, and FL distribution (Li and Dewey, 2011).

eXpress (Roberts and Pachter, 2013) allocates reads among targets by means of the EM algorithm. In the process, it uses the information about the unique reads, the location to which unique reads and multireads are mapped to, the FL, the target length, positional bias, and sequencing errors. The algorithm returns the target abundance for each sample, and the probability that a given multiread originates from a specific genomic target.

3. Data

Due to the absence of real-life RNA-seq data for which the true expression levels are known, we conducted a number of simulation studies to validate the performance of the unique and the proportional methods, RSEM, and eXpress, when specifying the FL distribution; that is, the mean FL and the standard deviation. In addition, we specified extra iterations of the EM-algorithm for eXpress. We also present results from the “TF” and “TF simple” (TF without edge effect corrections).

We simulated short reads using the R package Polyester (Frazee et al., 2015). Polyester allows full control over the simulation setup, the sequencing FL, and the RL. We point out that the data generation mechanism used by Polyester is fully independent from our concept. As it has been shown that the fragment–length–distribution can be approximated by a Gaussian one (Li and Dewey, 2011), we generated fragments from a normal distribution with an average length of 50 base pairs (bp) and a standard deviation equal to 3 (sd = 3). The RL was set to 35 bp. To mimic experimental data, we shuffled the obtained short sequences.

We simulated data for three different settings (Fig. 3). In each setting, two experimental conditions with two replicates per condition were generated. In Settings 1 and 3, two genomic targets share one identical region, while in Setting 2 three genomic targets share two identical regions: genomic target A shares a segment with targets B and C, and genomic targets B and C share one extra region. We generated ∼1000 reads for target A and 300 reads for target B in Setting 1 and Setting 3, and in Setting 2, 1000, 300, and 500 reads for targets A, B, and C, respectively (Table 1). We mapped the generated reads against the reference sequence using Bowtie (Langmead et al., 2009). We allowed Bowtie to report up to two valid alignments in Settings 1 and 3, and up to three in Setting 2. Subsequently, for each target, the numbers of unique reads and multireads were obtained based on the number of mapping locations.

FIG. 3.

Simulation Settings 1, 2, and 3. The light grey and black segments indicate shared regions, whereas the other shades of grey segments indicate unique regions for, respectively, genomic targets A, B, and C. Settings 1 and 3 use the same set of targets, but the experiment type is different. Single-end and paired-end experiments were used in Settings 1 and 3, respectively.

Table 1.

The Expression Levels in Conditions 1 and 2

Setting	Target	Abundance condition 1	Abundance condition 2
1	Target A	1000	3200
1	Target B	300	240
2	Target A	1000	3200
	Target B	300	240
	Target C	500	600
3	Target A	1000	3200
3	Target B	300	240

Table 2 shows the four different simulation scenarios that we considered: mapping without allowing for mismatches, mapping with allowing mismatches, in combination with reads that were generated without or with sequencing error.

Table 2.

The Different Simulation Scenarios

Scenario	Mismatches	Error
1	No	No
2	Yes	No
3	No	Yes
4	Yes	Yes

Mismatches indicate mapping without mismatches (no) and mapping allowing for mismatches (yes). Error indicates simulated reads without sequencing error (no) and allowing for sequencing error (yes).

To evaluate the impact of the compared methods on the abundance estimation, we considered the difference between the read counts estimated by each method and the true expression level (Ji et al., 2011; Pasaniuc et al., 2011). We defined the resulting error rate as follows: $E r r o r r a t e = \frac{|C o u n t s_{T R U E} - C o u n t s_{E S T I M A T E D}|}{C o u n t s_{T R U E}} .$ (11)

4. Results and Discussion

4.1. Setting 1: two genomic targets

In Setting 1, we considered two targets (A and B) for which single-end reads were generated and mapped to the reference sequence. Figure 3 shows the placement of the shared region between target A and target B.

In the first simulation scenario, the reads were generated without sequencing errors, and were aligned to a reference sequence without allowing for mismatches. Table 3 summarizes the read counts obtained by all methods of handling ambiguously mapped short sequences. Figures 4 and 5 show the error rates of the unique method, the proportional method, eXpress, RSEM, “TF simple,” and TF. Among the compared approaches, TF and RSEM achieved the highest accuracy in expression-level estimation. In every sample, for both targets, the estimated numbers of short sequence reads were almost identical to the true counts (Table 3 and Figs. 6 and 7); that is, for TF the error rate was on average equal to 0.0276 and 0.0297 for targets A and B, respectively, while for RSEM the average error was equal to 0.005 and 0.028 for targets A and B, respectively (Figs. 4 and 5).

FIG. 4.

Results for Scenario 1 in Setting 1 comparing the error rate of target abundance-level estimation (target A). Shown are the distances between read counts estimated by each method and true expression level for 1000 simulated data sets.

FIG. 5.

Results for Scenario 1 in Setting 1 comparing the error rate of target abundance-level estimation (target B). Shown are the distances between read counts estimated by each method and true expression level for 1000 simulated data sets.

FIG. 6.

Results for Scenario 1 in Setting 1 comparing the target abundance-level estimation (target A). The boxplots show the abundances estimated by each method for 1000 simulated data sets. The true expression level is represented by a black line.

FIG. 7.

Results for Scenario 1 in Setting 1 comparing the target abundance-level estimation (target B). The boxplots show the abundances estimated by each method for 1000 simulated data sets. The true expression level is represented by a black line.

Table 3.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 1 (No Sequencing Error, No Mismatches) in Setting 1

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TF simple counts	TF counts
1	1	A	1000.498	327.577	778.428	1265.152	1000.603	1123.121	1001.075
	1	B	299.596	219.529	521.666	322.354	299.491	315.714	299.572
	2	A	1000.037	326.841	777.931	1264.857	1000.211	1120.598	998.826
	2	B	299.273	219.050	521.379	321.751	299.099	315.026	298.918
2	1	A	3200.576	1048.028	2946.436	4030.062	3200.499	3593.239	3202.774
	1	B	239.476	175.577	493.616	257.628	239.553	252.505	239.594
	2	A	3200.913	1047.151	2944.885	4029.948	3200.398	3590.232	3200.093
	2	B	240.312	176.500	496.334	259.014	240.827	253.833	240.854

The column “TF simple counts” shows the estimated expression levels without the edge effect correction, while the last column shows the estimated expression levels with edge effect correction.

RSEM, RNA-Seq by expectation maximization; TF, theoretical framework.

The approach without correcting for the edge effects (“TF simple”) systematically overestimated the abundances. In particular, the error rate was larger than that for TF; that is, 0.1219 and 0.0573 for targets A and B, respectively. This indicates that the proposed edge effect corrections are important. For the proportional method, the number of short sequences was substantially underestimated (target A, error rate = 0.1509) or overestimated (target B, error rate = 0.9034). This suggests that using only the number of unique reads to allocate the multireads across multiple targets is insufficient, and can lead to relatively large bias. When using eXpress, with the mean and standard deviation for the FL specified, the error rate was equal to 0.2619 and 0.077 for targets A and B, respectively.

In addition, in the first simulation scenario we generated 50 bp-long reads. The same conclusions as is a case of 35 bp-long reads can be made among considered methods (Supplementary Table S1 and Supplementary Figs. S1 and S2).

Similar observations can be made for simulation Scenario 2, where the reads were aligned with up to two mismatches (Table 4). The proportional method led to underestimation in case of genomic target A (error rate = 0.1602) and substantial overestimation of the abundance level of target B (error rate = 0.9621). eXpress overestimated the abundance especially for target A, while TF and RSEM returned very accurate and very precise abundance estimates (Supplementary Figs. S3, S4, S5, S6).

Table 4.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 2 (No Sequencing Error, Allowing for Mismatches) in Setting 1

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TF simple counts	TF counts
1	1	A	1000.498	311.859	765.465	1265.142	1000.603	1123.121	1001.075
	1	B	299.596	217.813	534.629	322.363	299.491	315.714	299.572
	2	A	1000.037	311.040	764.957	1264.943	1000.211	1120.598	998.826
	2	B	299.273	217.270	534.353	321.676	299.099	315.026	298.918
2	1	A	3200.576	997.583	2928.598	4030.004	3200.499	3593.239	3202.774
	1	B	239.476	174.225	511.455	257.678	239.553	252.505	239.594
	2	A	3200.913	996.978	2927.181	4029.988	3200.398	3590.232	3200.094
	2	B	240.312	175.089	514.044	258.979	240.827	253.833	240.854

The column “TF simple counts” shows the estimated expression levels without the edge effect correction, while the last column shows the estimated expression levels with edge effect correction.

In simulation Scenarios 3 and 4, the short sequence reads were created with the possibility of sequencing errors (probability 0.5%). Tables 5 and 6 show the abundance estimates in the presence of sequencing errors when mapping without any mismatch (Scenario 3) and when allowing for up to two mismatches (Scenario 4). The tables also show that the bias of TF's and RSEM's estimates was much smaller than that for the unique and proportional methods. For the unique method, the abundance was, as could be expected, strongly underestimated, while for the proportional method the abundance was systematically underestimated for target A, while overestimated for target B. “TF simple” slightly outperformed eXpress in case of both targets (Supplementary Figs. S7, S8, S9, S10).

Table 5.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 3 (Sequencing Error, No Mismatches) in Setting 1

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TF simple counts	TF counts
1	1	A	839.747	274.921	653.380	1061.833	839.833	942.586	840.159
	1	B	251.236	184.122	437.613	270.366	251.160	264.794	251.255
	2	A	839.144	274.427	653.127	1061.373	839.293	940.893	838.649
	2	B	250.693	183.503	436.721	269.512	250.555	263.904	250.410
2	1	A	2685.718	879.044	2472.351	3381.891	2685.787	3013.865	2686.359
	1	B	201.079	147.365	414.477	216.251	201.041	211.932	201.096
	2	A	2685.749	879.030	2471.369	3381.620	2685.498	3013.817	2686.316
	2	B	201.669	147.998	416.087	217.176	201.958	212.843	201.960

The column “TF simple counts” shows the estimated expression levels without the edge effect correction, while the last column shows the estimated expression levels with edge effect correction. The sequencing error was set on 0.5%.

Table 6.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 4 (Sequencing Error, Mismatches) in Setting 1

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TF simple counts	TF counts
1	1	A	999.820	327.291	744.522	1264.286	999.931	1122.141	1000.201
	1	B	299.387	219.375	505.478	322.136	299.276	315.493	299.362
	2	A	999.327	326.520	794.145	1264.073	999.486	1119.497	997.845
	2	B	299.072	218.916	507.855	321.436	298.913	314.833	298.735
2	1	A	3198.221	1047.036	2868.240	4027.032	3198.133	3589.838	3199.742
	1	B	239.303	175.490	521.760	257.498	239.391	252.378	239.476
	2	A	3198.561	1046.149	2842.716	4027.012	3198.028	3586.797	3197.031
	2	B	240.123	176.411	525.284	258.789	240.657	253.705	240.732

4.2. Setting 2: three genomic targets

We considered three reference targets (A, B, and C). All three targets shared the same sequence region. In addition, target B and target C shared one more sequence region (Fig. 3). The locations of the overlapping regions were at the beginning, in the middle, and at the end of the genomic targets. The unique, proportional method, eXpress, RSEM, “TF simple,” and TF were applied to four simulation scenarios to assess the expression levels (Table 7 and Supplementary Tables S2, S3, S4 and Supplementary Figs. S11, S12, S13). Results were similar to those observed for Setting 1: TF and RSEM estimated the abundance of the three targets with a very small bias, while the unique method systematically underestimated the true abundance level. The proportional method led to a larger bias and error rate than TF.

Table 7.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 1 (No Sequencing Error, No Mismatches) in Setting 2

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TF simple counts	TF counts
1	1	A	998.692	327.353	643.365	1263.693	999.044	1122.353	1000.391
		B	301.346	169.730	365.536	318.798	300.864	353.393	301.019
		C	499.217	367.046	790.354	535.349	499.348	564.163	499.413
	2	A	999.473	326.738	643.153	1264.677	1000.018	1120.245	998.511
		B	299.436	168.650	363.697	317.109	299.050	351.144	299.104
		C	500.476	367.484	792.535	536.293	500.317	564.837	500.009
2	3	A	3197.748	1044.553	2537.242	4027.637	3198.133	3581.325	3192.154
		B	240.525	135.809	353.523	253.973	240.560	282.767	240.860
		C	599.818	440.746	1147.326	641.382	599.398	677.443	599.691
	4	A	3198.856	1047.167	2539.428	4028.151	3198.454	3590.287	3200.142
		B	240.390	135.888	353.268	254.186	240.902	282.931	241.000
		C	599.877	441.010	1146.427	641.837	599.767	677.849	600.050

The column “TF simple counts” shows the estimated expression levels without the edge effect correction, while the last column shows the estimated expression levels with edge effect correction.

4.3. Setting 3: paired-end short sequences

Since paired-end reads provide superior mapping results (Li et al., 2015), paired-end sequencing is also applied to mRNA-seq experiments. Table 8 summarizes the results for simulation Scenario 1 in Setting 3. For this setting, we did not have to specify the FLs for eXpress, as eXpress can derive the FL distribution from paired-end reads on-the-fly. Similarly to Setting 1 and Setting 2, TF and RSEM clearly outperformed all other methods. Simulation Scenarios 3 and 4 in Setting 3 included sequencing error (probability 0.5%). This error was applied to only one of the two reads that form a pair. The results for Scenarios 2, 3, and 4 are described in detail in Supplementary Tables S5, S6, S7 and Supplementary Figs. S14, S15, S16, S17. The unique method systematically underestimated the true expression level. For the proportional method, a trend similar to the one observed in Settings 1 and 2 was seen. The abundance was underestimated and overestimated for target A and target B, respectively. eXpress led to smaller bias in expression-level estimation in case of target B compared with target A. For target A, eXpress performed similarly to TF without edge effect correction. In addition, in Scenario 1 we generated reads of length 50 bp. Similar conclusions as is in case of reads 35 bp long can be made among compared methods (Supplementary Table S8 and Supplementary Figs. S18 and S19).

Table 8.

Overview of the Target Abundance Estimates Averaged Across 1000 Data Sets for Simulation Scenario 1 (No Sequencing Error, No Mismatches) in Setting 3

Condition	Sample	Target	True counts	Unique counts	Proportional counts	eXpress counts	RSEM counts	TFsimple counts	TF counts
1	1	A	1001.276	367.032	801.443	1259.133	1001.396	1258.395	1001.473
	1	B	300.989	229.349	500.822	330.018	300.869	329.837	300.873
	2	A	999.589	366.776	800.377	1256.784	999.547	1257.518	1000.775
	2	B	300.114	228.854	499.327	329.234	300.156	329.125	300.223
2	3	A	3200.671	1173.162	2976.908	4023.072	3200.915	4022.270	3201.056
	3	B	240.033	182.788	463.796	262.921	239.789	262.876	239.791
	4	A	3199.445	1172.796	2974.700	4020.694	3198.996	4021.015	3200.058
	4	B	239.824	183.165	464.569	263.452	240.273	263.418	240.286

The column “TF simple counts” shows the estimated expression levels without the edge effect correction, while the last column shows the estimated expression levels with edge effect correction.

5. Discussion

We presented a set of principles underlying multiread allocation and estimation of the abundance of genes, isoforms, and/or exons in the presence of reads with multiple mapping locations. These principles indicate that at least three additional sources of information should be used when estimating the expression level of genomic targets; that is, the RL, the target length, and the length of the overlapping region.

As can be seen from the results, approaches that use this additional information partially (“eXpress” and “RSEM”) or fully (“TF” and “TF simple”) are more precise and accurate than methods that do not consider this information (“unique” and “proportional”).

We point out that Equations (5)–(8), derived from the proposed principles, are used only as an indication of the potential of the underlying principles, and therefore should be considered as a theoretical concept. The practical implementation of the underlying principles can be challenging, as several issues have to be overcome. Similarly to RSEM and eXpress, the FL distribution is not readily available in case of single-end sequencing. One can extract this distribution, after conducting an experiment, with the help of tools such as Bioanalyzer (www.genomics.agilent.com). Paired-end sequencing has the advantage that the FL can be obtained from the data. Moreover, since both ends have to map to a genomic target, the three proposed formulae for total abundance estimation reduce to a single equation. Another issue is identifying the overlapping regions of genomic targets. This process can be very time consuming. However, it has to be executed only once for each reference genome. Two additional challenges with respect to the overlapping regions have to be dealt with. First, it is likely that multiple overlapping regions within one genomic target can be found. As a consequence, additional edge effect corrections have to be defined. Second, in case of nonidentical shared segments, one should divide these shared segments in subparts that are identical to segments in other genomic targets. We believe that these challenges can be tackled, and methods that can be used for this purpose are the topics of this research.

6. Conclusion

RNA-seq has become the major technology for measuring RNA expression either on gene-, transcript-, or exon level. Accurate estimation of the expression level is crucial, but is still problematic in spite of existing methods. Current literature shows that the mapping process might be inaccurate, which is a big impediment in abundance estimation. That is, for a large fraction of short sequence reads several genomic locations can be found. Often, reads aligning to more than one region are ignored, or randomly assigned to one location. However, such strategies lead to incorrect estimation of expression levels. A number of methods have been proposed to estimate abundance in the presence of multireads. They either handle ambiguities in the alignment process (Langmead et al., 2009; Ji et al., 2011), or try to diminish biases occurring in reads quantification (Li and Dewey, 2011; Roberts and Pachter, 2013).

In this article, we highlight the underlying mechanism of multireads allocation. We present a conceptual framework that shows the importance of the RL, the genomic target length, and the overlapping region length for the abundance estimation of genes, transcripts, and/or exons in the presence of multireads. These pieces of information enable determining the correct number of multireads originating from a genomic location, and as such ensure high accuracy in expression quantification. Our results show that algorithms that incorporate (partially) this information into their estimation procedure, such as RSEM and eXpress, outperform the unique method and the proportional method by achieving higher accuracy in single-end and in paired-end sequencing experiments.

Our findings also suggest that the RL, the genomic target length, and the overlapping region length affect ambiguous reads occurrence. Shorter reads and targets incur more ambiguity in reads alignment and larger bias in expression quantification. However, RSEM, eXpress, and the set of proposed equations can handle these impediments.

Footnotes

Acknowledgments

The authors acknowledge Joke Reumers for the useful comments and fruitful discussions.

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

K.G. acknowledges funding by the BOF-BiLA grant (R-7699) from Hasselt University.

Supplementary Material

References

Anders

, and Huber

2010. Differential expression analysis for sequence count data. Genome Biol. 11, R106.

Anders

, Pyl

P.T.

, and Huber

2015. HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics, 31, 166–169.

Anders

, Reyes

, and Huber

2012. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017.

Andrews

2016. FastQC: A quality control tool for high throughput sequence data. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed October 13, 2019.

Bentley

D.R.

, Balasubramanian

, Swerdlow

H.P.

, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.

Conesa

, Madrigal

, Tarazona

, et al. 2016. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13.

Consiglio

, Mencar

, Grillo

, et al. 2016. A fuzzy method for RNA-Seq differential expression analysis in presence of multireads. BMC Bioinformatics, 17, 345.

Eswaran

, Horvath

, Godbole

, et al. 2013. RNA sequencing of cancer reveals novel splicing alterations. Sci. Rep. 3, 1689.

Frazee

A.C.

, Jaffe

A.E.

, Langmead

, et al. 2015. Polyester: Simulating RNA-seq datasets with differential transcript expression. Bioinformatics, 31, 2778–2784.

10.

Grabherr

M.G.

, Haas

B.J.

, Yassour

, et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652.

11.

Hashimoto

, de Hoon

M.J.

, Grimmond

S.M.

, et al. 2009. Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, 25, 2613–2614.

12.

, Xu

, Zhang

, et al. 2011. BM-map: Bayesian mapping of multireads for next-generation sequencing data. Biometrics, 67, 1215–1224.

13.

Langmead

, Trapnell

, Pop

, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25.

14.

, and Dewey

C.N.

2011. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12, 323.

15.

, Ruotti

, Stewart

R.M.

, et al. 2010. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26, 493–500.

16.

, Ruan

, and Durbin

2008a. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858.

17.

, Li

, Kristiansen

, et al. 2008b. SOAP: Short oligonucleotide alignment program. Bioinformatics, 24, 713–714.

18.

, and Jiang

2012. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics, 28, 2914–2921.

19.

Y.L.

, Weng

J.C.

, Hsiao

C.C.

, et al. 2015. PEAT: An intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics 16 Suppl, 1, S2.

20.

Lister

, O'Malley

R.C.

, Tonti-Filippini

, et al. 2008. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133, 523–536.

21.

Marioni

J.C.

, Mason

C.E.

, Mane

S.M.

, et al. 2008. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517.

22.

Mortazavi

, Williams

B.A.

, McCue

, et al. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods, 5, 621–628.

23.

Nariai

, Kojima

, Mimori

, et al. 2014. TIGAR2: Sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads. BMC Genomics 15 Suppl, 10, S5.

24.

Oshlack

, Robinson

M.D.

, and Young

M.D.

2010. From RNA-seq reads to differential expression results. Genome Biol. 11, 220.

25.

Pasaniuc

, Zaitlen

, and Halperin

2011. Accurate estimation of expression levels of homologous genes in RNA-seq experiments. J. Comput. Biol. 18, 459–468.

26.

Robert

, and Watson

2015. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol. 16, 177.

27.

Roberts

, and Pachter

2013. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods, 10, 71–73.

28.

Robinson

M.D.

, McCarthy

D.J.

, and Smyth

G.K.

2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–140.

29.

Taub

, Lipson

, and Speed

T.P.

2010. Methods for allocating ambiguous short-reads. Commun. Inf. Syst. 10, 69–82.

30.

Trapnell

, Williams

B.A.

, Pertea

, et al. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515.

31.

Wickramasinghe

, Cánovas

, Rincón

, et al. 2014. RNA-Sequencing: A tool to explore new frontiers in animal genetics. Livest. Sci. 166, 206–216.

32.

Zerbino

D.R.

, and Birney

2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

0.04 MB

0.05 MB

0.54 MB

0.05 MB

0.04 MB

0.05 MB

0.04 MB

0.05 MB