IDBA-MTP: A Hybrid Metatranscriptomic Assembler Based on Protein Information

Abstract

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing (NGS) technology, biologists can study the microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data and performs well only on high-expressed mRNAs. This article introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is nontrivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques, and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs.

1. Introduction

The traditional approach for studying microorganisms is to isolate and cultivate each single microorganism and then study its behavior, such as gene expression levels, under different environments. As different microbes usually live together to form a microbial community, isolating a single microbe is usually impossible (Eisen, 2007) and, even if possible, changes the microbe's living behavior in a microbial community. Metatranscriptomic studies in the past have been based on microarrays or cDNA clone libraries (Poretsky et al., 2005; Tartar et al., 2009; Booijink et al., 2010). The microarray-based approach (Parro et al., 2007) requires knowledge of target mRNA sequences, which limits its usefulness in relation to novel mRNAs. cDNA clone libraries, on the other hand, can be applied to novel mRNAs, but the approach is labor-intensive and the estimations of expression levels of mRNAs are inaccurate.

High-throughput next-generation sequencing (NGS) technology (Bosch and Grody, 2008; Morozova and Marra, 2008; Fullwood et al., 2009; Pettersson et al., 2009) introduces a new and better approach for studying metatranscriptomic data. By sequencing reads from mRNA sequences of a sample, scientists can reconstruct novel mRNA sequences by assembling reads and can estimate the expression levels of each mRNA by the number of reads aligned to the mRNA sequence. Currently, there are two main NGS technologies for metatranscriptomic data: pyrosequencing technology and synthesis technology. Pyrosequencing technology (Tatusov et al., 1997; Frias-Lopez et al., 2008; Gilbert et al., 2008; Poretsky et al., 2010) produces long reads (of length about 400 bp) with relatively higher cost (over 40 times higher for the same throughput). Since the read length is long, no or limited assembly is required. Pyrosequencing technology has achieved promising results for soil samples (Tatusov et al., 1997) and marine samples (Frias-Lopezet al., 2008; Gilbert et al., 2008). Synthesis technology, on the other hand, produces relatively short reads (of length ranging from 75 to 150 bp) at much lower cost. Since the length of reads produced by synthesis technology is much shorter than the length of the mRNA sequence (about 1000 bp), the reads need to be assembled into longer sequences (contigs) before analysis.

Compared with assembling genomic, transcriptomic, or metagenomic data, assembling metatranscriptomic data is much more difficult because of the following reasons.

1. Repeat patterns across different mRNAs: Repeat patterns usually introduce ambiguity during assembly and are a common problem in all types of assembling. However, the problem is more serious in metatranscriptomes than in other data. Many genes exist in multiple species with similar functions and the resultant proteins share common protein domains (Glazer and Kechris, 2009). As a result, in the metatranscriptomic data, many different mRNAs have similar patterns. According to analysis of GenBank (Benson et al., 2000), based on known gene information, 24.53% of bacteria genes contain at least one repeat pattern of length longer than 100 bp (note that, in this analysis, different versions of the same genes from the same bacteria were ignored and only the repeat patterns in genes from different bacteria were considered). In these circumstances, assemblers, not specially designed for metatranscriptome data, produce either short contigs or chimeric contigs that merge mRNA sequences from more than one gene (Leung et al., 2013). This is consistent with our experiments (see Table 2): these assemblers can either only recover 31% of mRNAs with average contig length of 194 bp and 4.14% error rate (Oasis), or recover more mRNAs (59.29%) with longer average contig length (395 bp) but the error rate is increased to 10.73% (IDBA-UD).

2. Extreme differences in abundances: For the DNA genome assembly problem of a single species, this is not a problem because there is one abundance only. On the other hand, in transcriptomic data and metagenomic data, since the abundances of different mRNAs and the number of genomes vary [can be 100 times and 1,000 times different, respectively (Qin et al., 2010)] because of different expression levels and abundances of species, erroneous reads cannot be identified easily by sampling rates. In metatranscriptomic data, this problem becomes more serious. Since both the abundances of species and the expression levels of mRNAs from the same species may vary, the abundances of different mRNAs can vary much more significantly (over 100,000 times). Thus, low-expressed mRNA sequences are very difficult to reconstruct as correct reads from these sequences and erroneous reads are very difficult to distinguish. As Table 1 shows for our experiments on low-expressed mRNAs, the performance of existing assemblers suffers.

Table 1.

Experimental Result of Simulated Data with Low-Abundance Ratios

Software	Coverage (%)	Maximum length (bp)	Average length (bp)	No. of wrong contigs (length, bp)	No. of correct contigs (length, bp)	Error rate (%)
Oases	25.99	524	172	9 (1,063)	149 (25,690)	3.97
Trinity	9.85	497	287	17 (7,362)	34 (9,837)	42.80
IDBA-UD	48.26	783	342	8 (4,425)	83 (28,480)	13.45
IDBA-MT	52.68	900	279	8 (3,194)	136 (37,993)	7.75
IDBA-MTP	66.00	916	273	5 (1,057)	156 (42,771)	2.40

Thus, existing assemblers for genomic, transcriptomic, and metagenomic data do not perform well on metatranscriptomic data especially for the low-expressed transcriptomes (Leung et al., 2013). To our best knowledge, IDBA-MT (Leung et al., 2013) is the only assembler designed for metatranscriptomic data. IDBA-MT aims at solving the repeat pattern problem. By applying information from paired-end reads, IDBA-MT can resolve some of the chimeric contigs (see Table 2; IDBA-MT can recover more mRNAs while decreasing the error rate from about 10% to 5% when compared to IDBA-UD). However, this approach can work only for high-expressed transcriptomes with high sequencing depths as it relies on paired-end data and fails when there are insufficient sampling reads from the mRNAs (i.e., low-expressed mRNAs).

Table 2.

Experimental Result of Simulated Data with Mixed-Abundance Ratios

	Coverage (%)
Software	Total	≤5x	>5x	Maximum length (bp)	Average length (bp)	No. of wrong contigs (length, bp)	No. of correct contigs (length, bp)	Error rate (%)
Oases	31.00	22.46	8.45	676	194	63 (8,471)	1009 (196,162)	4.14
Trinity	15.10	11.28	3.80	1,270	319	106 (75,713)	310 (99,603)	43.18
IDBA-UD	59.29	42.74	16.38	1,430	395	43 (28,837)	606 (239,887)	10.73
IDBA-MT	64.07	46.53	17.37	1,511	310	37 (18,023)	1005 (312,500)	5.45
IDBA-MTP	69.62	51.29	18.33	1,615	368	41 (23,461)	1127 (415,813)	5.34

Similar to genome assembly, besides de novo assembly, one can apply the reference-based approach. Existing work tries to reconstruct mRNAs by aligning metatranscriptomic reads to known genomes or gene DNA sequences. However, this approach has had only limited success (Xiong et al., 2012) as the genomes of most microbes are still unknown (Eisen, 2007) and the microbe gene sequences mutate frequently.

1.1. Our observations on the reference-based approach

Although the aforementioned reference-based approach has limited success, about 60–70% of the proteins in bacteria have similar sequences as some known proteins (Finn et al., 2000; Tatusov et al., 1997); thus, known reference protein sequences could help in the assembling of novel mRNAs. There are two difficulties to resolve in order to make use of the protein sequences. First, we need to consider amino acid instead of nucleotides. Even if we consider amino acid, it is not trivial due to the following. For proteins with similar functionalities, even though their structures are similar and their sequences share some conserved regions, the amino acid sequences corresponding to these conserved regions might not be exactly the same. Second, to consider amino acid, the information contained in a single read becomes much less (three nucleotides converted to one amino acid). Since one read corresponds to only about 25 amino acids, it is difficult to have a confident alignment (Xiong et al., 2012). Another approach is to align contigs, instead of reads, to proteins. However, as the performance of existing assemblers is not good, the resultant contigs are short or incorrect, and not many confident alignments can be obtained.

1.2. Our contributions

To overcome the first problem of amino acid similarity, we found that even though the amino acid sequences may not be exactly the same, it is known that some amino acids, though different, have similar chemical properties and functionality (Henikoff and Henikoff, 1992). Consequently, the mRNA can be reconstructed using the approach of first decoding the reads into peptide sequences and then aligning these peptide sequences to protein sequences based on the similarity of amino acids (e.g., Blosom 62). Thus, we incorporate the similarity of amino acids into our alignment algorithm. To solve the problem of short reads and low-expressed mRNAs, we make use of the paths of the de Bruijn graph with a small k.

Our proposed assembler, IDBA-MTP, reconstructs mRNA sequences from metatranscriptomic reads, especially for low-expressed mRNAs, using the information of known microbial protein sequences to guide the construction of contigs as follows. IDBA-MTP first constructs a de Bruijn graph from the input reads using a relatively small k (k = 21 bp) to compensate for the missing long k-mers in low-expressed mRNAs. Since k is small, the de Bruijn graph, though connected, has many branches representing repeat regions in the mRNA sequences (due to problems 1 and 2) and with each mRNA represented by one of its paths. In order to determine whether a path represents an mRNA sequence or not, IDBA-MTP will decode the path into a peptide sequence and then align it with known protein sequences. Those paths, which can be aligned to known protein sequences, should be potential candidates for mRNA sequences depending on their similarity and sequencing depths. However, since the number of paths is huge (many paths will not represent any mRNA sequences) and the alignment with the protein sequences is not straightforward because of the similar chemical properties of amino acids, a dynamic programming approach with a seed-and-extend (with the seed derived from the known protein sequences) heuristics is employed to reduce the complexity of the problem.

Since the candidate mRNA sequences are constructed by aligning known protein sequences, mRNA sequences for novel proteins cannot be reconstructed using this approach. An intuitive idea is to run IDBA-MT for novel mRNAs, and then combine the results of IDBA-MT and the output from our reference-based approach. However, some mRNA sequences may be reconstructed by both approaches, which results in redundant or similar contigs. To prevent having redundant contigs, IDBA-MTP will treat those mRNA sequences reconstructed by alignment of known reference proteins as long input reads for IDBA-MT; that is, the output of the first approach will be the input of the second approach. Experiments on simulated data show that even though 48% regions of the mRNAs can be aligned to known reference proteins, existing assemblers can reconstruct only contigs representing at most 62.9% of these regions. IDBA-MTP can reconstruct contigs covering 77.6% of these regions and some novel mRNAs using protein reference sequences. As a result, IDBA-MTP can reconstruct 14% more mRNAs (in terms of the total length of mRNAs) than existing assemblers.

The article is organized as follows. The IDBA-MTP algorithm is described in section 2. Experimental results for IDBA-MTP and other existing assemblers on both simulated and real metatranscriptomic data are presented in section 3. Conclusions are drawn on the performance of IDBA-MTP in section 4.

2. Methodology

Given a set of reads sampled from a set of mRNA sequences (with nucleotides A, C, G, and U), we can construct a de Bruijn graph where each vertex v represents a length-k substring (k-mer) of the reads. There is an edge that connects vertex u to vertex v if and only if the corresponding k-mers for vertex u and vertex v overlap at k−1 positions and appear in a read. An mRNA sequence can be represented by a path of k-mers in the de Bruijn graph. Since there are many paths in the de Bruijn graph and most of them do not represent any mRNA, a correct mRNA sequence R can be reconstructed from the de Bruijn graph if some known protein sequence P can be aligned to the path. If the alignment similarity between R and P is high, R will likely be an mRNA sequence in the sample.

A protein or peptide sequence is represented by a sequence of amino acids (of which there are 20 kinds). Given a length-3m mRNA sequence R, we can decode it into a length-m sequence D(R) of amino acids by converting each nonoverlapping coden (length-3 substring) in R into an amino acid character. Given a protein sequence P and an mRNA sequence R, P and D(R) can be aligned by inserting space characters in P and D(R) to form P′ and D(R)′ of equal length, respectively, and the similarity score based on this alignment is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}score_a \left( P^{\prime} , D ( R ) ^{\prime} \right) = \sum \delta ( P^{\prime} [i] , D ( R ) ^{\prime} [i] ) + p_{{\rm open}} \cdot \hbox{number of gaps} \tag{1}\end{align*} \end{document}

where P′ [i] and D(R)′[i] are the ith amino acid in P′ and D(R)′, respectively; δ(x,y) is the similarity score between amino acids x and y (which depends on their chemical properties and roles in the protein's functionality); p_open is the gap penalty; and a gap is defined as consecutive space characters in P′ or D(R)′ (the gap penalty can be refined to take the gap size into consideration). Note that the similarity score δ(x,y) can be negative and is −∞ whenever a stopping coden in D(R)′ is compared to space or any amino acid in P′. The optimal global similarity score between P and D(R) is defined as the highest similarity score of all alignments of P and D(R). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}score_g ( P , D ( R ) ) = max_{\cdots \cdots \cdots \cdots \cdot P^{\prime} \cdots D ( R ) ^{\prime}} \{score_a \left( P^{\prime} , D ( R ) ^{\prime} \right) \} \tag{2}\end{align*} \end{document}

Since the decoded protein from an mRNA usually does not exist in the protein database but some part of the decoded protein sequence might match with some regions of some proteins in the database because of their functional similarity, instead of aligning the whole sequence of P and D(R), the optimal local alignment between all substrings of P and D(R) is considered in IDBA-MTP and this information, in terms of contigs, will be needed for mRNA assembly later (see Section 2.3). The optimal local similarity score is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}score_l ( P , D ( R ) ) = max_{{\rm all\ substrings} \ p_s \cdots \cdot \cdot d_s \cdots \cdots \cdots \cdot D ( R ) } \{score_g ( p_s , d_s ) \} \tag{3}\end{align*} \end{document}

The protein-graph alignment (PGA) problem can be defined as follows: given a de Bruijn graph G and a protein P, find a path in G (representing a substring in an mRNA sequence R) such that score_l (P, D(R)) is maximized.

2.1. Dynamic programming

The PGA problem can be solved by dynamic programming based on the principle of optimality. Given are a decoded protein D(R) for an mRNA sequence R and a protein sequence P. Assume that the optimal global alignment of the substring d_sub (represented by a path Q(d_sub)) of D(R) and substring p_sub of P is Opt. Then, the same alignment Opt for any subpath of Q(d_sub) and the corresponding substring of p_sub should also be optimal.

Let S(v, i) define the maximum global similarity score between a substring of P ending at P[i] and all decoded sequences D(R) for path R in the de Bruijn graph G ending at vertex v. Similarly, we define S_M(v, i), S_P(v, i), and S_R(v, i) to be the maximum global similarity score with the following restrictions, respectively: (1) P[i] is aligned with the last amino acid of the corresponding protein sequence decoded from the path ending at vertex v; (2) P[i] is aligned with the space character; (3) the last amino acid of the corresponding protein sequence decoded from the path ending at vertex v is aligned with the space character. The value of S(v, i) is the maximum of 0 (alignment of two null substring), S_M(v, i), S_P(v, i), and S_R(v, i). The value of S_M(v, i), S_P(v, i), and S_R(v, i) can be calculated by considering the alignment of the last coden, and any length-3 path s → v with D(s → v) represents the decoded amino acid of path s → v, and the subproblem of alignment ending as vertex s.

S(v, i), S_M(v, i), S_P(v, i), and S_R(v, i) can be calculated as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}S ( v , i ) = \begin{cases} \qquad \qquad \qquad \quad 0 & \hbox{no path ending at} \ {\it v} \ \hbox{can be} \\ & \hbox{decoded to an amino acid} \\ \max \{0 , S_M ( v , i ) , S_P ( v , i ) , S_R ( v , i ) \} & \qquad \hbox{otherwise}\end{cases}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}S_M ( v , i ) = \begin{cases} \qquad \qquad - \infty & \hbox{no path ending at} \ {\it v} \ \hbox{can be} \\ & \hbox{decoded to an amino acid} \\ S ( s , i - 1 ) + \delta \left( P [i] , D ( s \rightarrow v ) \right) & \qquad \hbox{otherwise}\end{cases}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}S_P ( v , i ) = \begin{cases} \qquad \qquad \quad - \infty & \quad i = 0 \\ \max \{S_P ( v , i - 1 ) , S_M ( v , i - 1 ) + p_{{\rm open}} \} + \delta ( P [i] , space ) & \hbox{otherwise}\end{cases}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}S_R ( v , i ) = \begin{cases} \qquad \qquad \qquad \qquad \quad - \infty & \hbox{no path ending at} \ {\it v} \ \hbox{can be} \\ & \hbox{decoded to an amino acid} \\ \max \{S_P ( s , i ) , S_M ( s , i ) + p_{{\rm open}} \} + \delta ( space , D ( s \rightarrow v ) ) & \qquad \hbox{otherwise}\end{cases}\end{align*} \end{document}

If D(s → v) represents the stopping coden, δ(D(s→v), x) =−∞ . max_v,i{S(v, i)} represents the optimal local similarity score and the corresponding aligned mRNA sequence can be obtained by backtracking. Note that care should be taken for the starting vertex of the path. Since the starting vertex of a path in de Bruijn graph represents the length-k prefix of an mRNA and each subsequent vertex represents an extra nucleotide of the mRNA, we modify zero in-degree vertices in the de Bruijn graph implicitly such that each vertex represents only one single nucleotide (the last nucleotide of the k-mer) of an mRNA. Note that since the protein sequence P is fixed, there is a fixed order of parameter i for processing the dynamic programming. Thus, the algorithm is correct even if there is a loop in the de Bruijn graph (a vertex v is considered multiple time for different i).

Since there are at most 4³=64 length-3 paths s→v to a vertex v, each entry S(v, i), S_M(v, i), S_P(v, i), and S_R(v, i) can be computed in constant time by preprocessing. The time complexity for aligning a length-|P| protein P is O(n|P|) and for a set of protein sequences with total length m is O(nm), where n is the total number of vertices in the de Bruijn graph.

2.2. Seed-and-extend heuristic

Although the dynamic programming approach can solve the PGA problem in O(nm) time, n and m are usually large for real biological data (in the order of millions and thousand millions, respectively) and the running time for the above dynamic programming approach is too long for practical use. In order to speed up the running time, IDBA-MTP applies on seed-and-extend heuristic to speed up the process. Assume that the optimal local alignment of an mRNA and a protein has at least one aligned region with t consecutive matches of amino acids (with similarity score larger than a predefined threshold), the PGA problem can be solved by a seed-and-extend heuristic. Given a simple path (a path with all intermediate vertices have exactly one incoming and one outgoing edge) or a k-mer in the de Bruijn graph representing a length-t peptide (sequence of amino acids), the reference protein sequences containing this peptide can be obtained in constant time after O(m) preprocessing, where m is the total length of the reference proteins. By considering these positions as the starting alignment positions (seeds) and extending the alignment in both forward and backward directions using dynamic programming, a small subset of paths containing the seed as a subpath will be considered and the running time can be greatly reduced in practice.

2.3. Preventing redundant mRNAs

As some reference proteins could have similar sequences, these similar proteins might align to overlapping paths in the de Bruijn graph and similar mRNA sequences may be obtained. Among these similar mRNA sequences, it is likely that only one of them is correct, while the others are only artifacts caused by sequencing errors or misalignment. However, duplicate genes and genes with similar functions in different species may also introduce similar mRNA sequences. IDBA-MTP applies two techniques to remove artifacts. The first approach is to prevent aligning multiple proteins with seeds on the same simple path in the de Bruijn graph. Simple paths in the de Bruijn graph are sorted in decreasing order of lengths and are considered one by one.

Once a protein is aligned to a path R (with the maximum alignment score among all proteins) in the de Bruijn graph, all substrings in R are removed from the seed table and will not be considered as starting positions for alignment. Note that these simple paths could still be considered when extending the alignment of other proteins using dynamic programming. Although the first approach can determine some redundant contigs (contigs represent the same mRNAs), sequence error may introduce error paths in the de Bruijn graph. Therefore, some similar proteins may be aligned to overlapped but similar paths in the de Bruijn graph. In our experiment, there can be 50 similar paths represented by the correct and erroneous paths corresponding to the same mRNA. Thus, we should not output the aligned mRNAs directly. The second approach was considering these mRNAs as long reads and treating them as input to IDBA-MT for de novo assembly. By using these extra-long reads, paired-end reads, and sequencing depths information, IDBA-MT avoids assembling redundant mRNAs and can reconstruct novel mRNAs with no similar reference proteins.

3. Experiments

We compared the performances of Oases (Schulz et al., 2012), Trinity (Grabherr et al., 2011), IDBA-UD (Peng et al., 2012), IDBA-MT (Leung et al., 2013), and IDBA-MTP on a real dataset from mouse gut (Xiong et al., 2012) and two simulated datasets generated from known bacteria gene sequences obtained from GenBank (Benson et al., 2000). Oases and Trinity were designed for assembling transcriptomic data, IDBA-UD for assembling metagenomic data, and IDBA-MT for assembling metatranscriptomic data. All bacteria gene sequences with known sources in the GenBank were downloaded. To prevent selecting mRNAs from the same species (either from the same or from different strains), duplicated sequences were removed and only one version was kept. Note that similar mRNAs obtained from different bacteria would be kept. Similar to Leung et al. (2013), mRNAs sharing at least half of the sequences with other mRNAs were selected for generating a difficult dataset (mRNAs that do not share common sequence regions with others would be isolated in the de Bruijn graph and can be assembled easily). The resultant 658 mRNA sequences were used to generate the simulated data. Although the number of mRNA sequences selected was small compared with the real experiments, this small subset of mRNAs sequences with long repeats represented the most difficult part of assembling metatranscriptomic data.

The reference bacteria protein sequences for IDBA-MTP were downloaded from NCBI database and we used the Blosum-62 scoring matrix, open gap penalty=−10 − (−1)=−9 and gap extend penalty=−1 for calculating the similarity scores of protein sequences. In all experiments on simulated data, all the corresponding protein sequences of the 658 mRNA sequences were removed from the reference protein sequences for testing the performance of IDBA-MTP. When aligning protein to de Bruijn graph, we considered all optimal alignments as input for IDBA-MT because of two reasons: (1) If there is no path similar as the input protein, the local alignment result will output a path shorter than the read length, which does not affect the assembling result. (2) Since IDBA-MT required multiple supports from reads or contigs for constructing the de Bruijn graph and resolving branches, single erroneous aligned path will not mislead IDBA-MT for producing erroneous contig.

For each simulated dataset, we randomly picked length-75 bp paired-end reads from the RNA sequence with 1% sequencing error according to the predefined abundances. The mean insert distance of each paired-end read was 200 bp with a standard deviation of 10 bp. Two sets of simulated data were generated: (1) Low-abundance −100 mRNAs were sampled with 3× sequencing depth for evaluating the performances of the assemblers for mRNAs with low expression levels. (2) Mixture-abundance −658 mRNAs were sampled from 1000× to 3× sequencing depth with the number of mRNAs following the power law (number of mRNAs with a certain abundance is directly proportional to the negative of abundance ratio) for evaluating the performances of the assemblers for mRNAs with different expression levels.

All assemblers were tested on simulated data using default parameters. Each contig produced by the assemblers was aligned to the 658 mRNAs in the samples using Blat (Kent, 2002). A contig was considered correct if and only if at least 95% of the contig region could be aligned to the mRNA sequence with 95% similarity. Some short, even correct, contigs that could not align confidently to the 658 mRNAs were considered incorrect. Regions of mRNAs aligned by correct contigs were considered covered, and the coverage of an assembler was calculated as the ratio of regions in the mRNAs covered by the contigs produced by the assembler. Although Oases, Trinity, and IDBA-UD could produce scaffolds using paired-end reads, the scaffolds performed worse than the contigs in all simulated data because these assemblers connected contigs wrongly and produced long but incorrect scaffolds. Thus, we compared the performances of the assemblers based on the resultant contigs, and the experimental results are shown in Tables 1 and 2.

3.1. Low-abundance mRNAs

When the abundances of mRNAs were low, Oases and Trinity did not perform well in assembly because of the low sequencing depths and the similarity of mRNAs. Oases tended to produce confident but shorter contigs. As a result, it had a low error rate (3.97%) but the lengths of contigs were short (average length = 172 bp) and the coverage was not high (25.99%). Since Trinity was designed for assembling transcriptomic data for eukaryotic mRNAs and was not suitable for assembling prokaryotic mRNAs, the error rate of Trinity was high (42.80%) and the coverage was low (9.85%). IDBA-UD, which was designed for assembling metagenomic data, performed better than Oases and Trinity because it applied various technologies, for example, multiple k-mers, local assembling, and local coverage of contigs for assembling reads sampled from low-abundance genomes (mRNAs in this case). However, since the mRNAs had many similar sequences, IDBA-UD could not determine these chimeric contigs and the error rate was high (13.45%) but the coverage was acceptable (48.26%).

IDBA-UD had such high error rate because it merged two or more mRNA sequences incorrectly to produce chimeric contigs. IDBA-MT, which was designed for assembling metatranscriptomic data, outperformed IDBA-UD because it used paired-end reads information to resolve chimeric contigs. It achieved a relatively high coverage (52.68%) with a low error rate (7.75%). With the information from known protein sequences, IDBA-MTP further improved the coverage to 66.00% and had the lowest error rate (2.40%). We aligned the five erroneous contigs produced by IDBA-MTP to the erroneous contigs produced by IDBA-MT using blat. All the five erroneous contigs can be aligned to the erroneous contigs produced by IDBA-MT, which indicates that IDBA-MTP has not introduced new error compared with IDBA-MT.

3.2. mRNAs with different abundances

For the simulated data with mixed abundances, the overall performance of the assemblers improved because of the mRNAs with high abundances. We have also analyzed the coverage of low-abundance mRNAs (76% mRNA with sequencing depth ≤5×) and high-abundance mRNAs (24% mRNA with sequencing depth >5×). As expected, the high-abundance mRNAs had better overall results than the low-abundance mRNAs. Again, Oases produced short but confident contigs, achieved higher coverage (31.00%) than Trinity, and had the lowest error rate (4.14%). Trinity, which assembled many long and wrong contigs, had the lowest coverage (15.10%) and the highest error rate (43.18%). IDBA-UD had higher coverage (59.29%) and moderate error rate (10.73%). By resolving some chimeric contigs, IDBA-MT had slightly higher coverage (64.07%) and lower error rate (5.45%) than IDBA-UD. IDBA-MTP had the highest coverage (69.62%) and a low error rate (5.34%).

Considering the performance of mRNAs with different abundances, IDBA-MTP could reconstruct 5% and 1% more mRNAs with low and high abundances, respectively, than the best existing assembler IDBA-MT. By using protein reference information, the performance of IDBA-MTP improved not only for the low-abundance mRNAs, but also for the high-abundance mRNAs. We also aligned the 41 erroneous contigs produced by IDBA-MTP to the erroneous contigs produced by IDBA-MT as in Section 3.2. We find that 7 out of 41 erroneous contigs produced by IDBA-MTP can be aligned. This indicates that some of the erroneous contigs produced by IDBA-MTP are due to incorrect alignment between reference protein and the de Bruijn graph. However, IDBA-MTP can prevent slightly more erroneous contigs when compared with IDBA-MT.

3.3. Real metatranscriptomic data

Xiong et al. (2012) isolated mRNAs from the lumen of the cecum and colon of 4 mice at 12 weeks of age, colonized with an altered Schaedler flora containing 8 known species without reference genomes. A total of 3.3 million paired-end reads were generated using Illumina sequencing technology. The read length was about 75 bp and the insert distance was about 300 bp. Similar to Leung et al. (2013), we merged the reads sampled from the four mice into a single dataset as the number of reads in each sample was small. The reads were inputted to existing assemblers for comparison. Since there were no reference genomes, we evaluated the accuracy of output contigs by aligning them to known protein sequences using Blastx with default parameters. A contig was considered “correct” if at least 90% of the contig sequence could be aligned to a single protein sequence. We used the number of aligned contigs instead of the number of aligned proteins to evaluate the result because a contig can be aligned to hundreds of similar proteins and it is difficult to evaluate the software based on the number of discovered proteins. Note that IDBA-UD, IDBA-MT, and IDBA-MTP consider that each k-mer in the de Bruijn graph should belong to at most one contig, and they should not output redundant contigs representing the same protein or protein regions.

The results are shown in Table 3. Similar to simulated data, Oases produced very short contigs. Since it was difficult to obtain confident alignment for short contigs, only 489 (out of 99,611) contigs produced by Oases could be aligned to known protein sequences. Trinity produced longer contigs than other assemblers. However, less than half of them (7,188 out of 19,721) could be aligned to known protein sequences although the contigs were long enough for confident alignment. The performances of IDBA-UD and IDBA-MT were similar with half of the contigs aligned to known protein sequences. IDBA-MTP produced a thousand more contigs than IDBA-UD and IDBA-MT. Since the extra contigs constructed mainly due to using protein reference sequences, most of these extra contigs could be aligned to known protein sequences.

Table 3.

Experimental Result of Real Mouse Gut Data

Software	Maximum length (bp)	Average length (bp)	Contigs number	Total length (bp)	No. of contigs aligned to known proteins (length, bp)
Oases	693	127	99,611	12,655,199	489 (84,044)
Trinity	15,857	500	19,721	9,862,469	7,188 (2,994,588)
IDBA-UD	10,741	490	18,951	9,287,101	9,510 (4,178,162)
IDBA-MT	8,863	490	18,972	9,301,484	9,515 (4,181,949)
IDBA-MTP	9,070	477	20,062	9,581,626	10,429 (4,712,857)

4. Conclusions

Existing assemblers do not perform well on metatranscriptomic data, especially on low-expressed mRNAs. In this article, we have proposed IDBA-MTP to assemble mRNAs, making use of information from the database of millions of known protein sequences. In particular, dynamic programming technique with a seed-and-extend heuristics was introduced to reconstruct mRNA sequences from paths in the de Bruijn graph with maximum similarity scores when aligned with the known protein sequences. Experimental results on both simulated and real biological data showed that IDBA-MTP outperformed existing assemblers on metatranscriptomic data.

However, when applying IDBA-MTP on metatranscriptomic data, there is an issue of running time and memory consumption when compared with existing assemblers. Since the reference proteins database is big and highly redundant (i.e., many proteins with very similar sequences exist), IDBA-MTP requires 20 GB memory for storage of reference proteins and takes 1 or 2 days for aligning millions of reference proteins to the de Bruijn graph even using the seed-and-extend heuristic. This extra memory consumption is large and its running time is much longer than existing assemblers, which take 1 or 2 hours to assemble the reads. The memory consumption and running time can be greatly reduced if the biologists have some target reference proteins. Besides, although it may not be a problem at current state because the extra memory consumption is limited and it takes weeks to generate a metatranscriptomic dataset (the assembling time can be neglected), further research should be performed to increase the speed of IDBA-MTP by preprocessing the reference proteins or parallel processing.

The technique of assembly based on known protein sequence information is applicable not only on metatranscriptomic data. It can also improve the performance on transcriptomic data of single organisms. We are planning to study the usage of protein reference sequence information on transcriptomic assembly of single species.

Footnotes

Acknowledgments

This work was supported by Hong Kong GRF HKU 7111/12E, HKU 719709E, and 719611E; Shenzhen basic research project (JCYJ20120618143038947); NSFC (National Natural Science Foundation of China) (11171086); and Outstanding Researcher Award (102009124).

Author Disclosure Statement

No competing financial interests exist.

References

Benson

, Karsch-Mizrachi

, Lipman

, et al. 2000. GenBank. Nucleic Acids Res., 28, 15–18.

Booijink

, Boekhorst

, Zoetendal

, et al. 2010. Metatranscriptome analysis of the human fecal microbiota reveals subject-specific expression profiles, with genes encoding proteins involved in carbohydrate metabolism being dominantly expressed. Appl. Environ. Microbiol. 76, 5533–5540.

Bosch

, and Grody

2008. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J. Mol. Diagn., 10, 484–492.

Eisen

2007. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol., 5, e82.

Finn

, Tate

, Mistry

, et al. 2000. The Pfam protein families database. Nucleic Acids Res., 28, 263–266.

Frias-Lopez

, Shi

, Tyson

, et al. 2008. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. USA, 105, 3805–3810.

Fullwood

, Wei

, Liu

, et al. 2009. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19, 521–532.

Gilbert

, Field

, Huang

, et al. 2008. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS ONE, 3, e3042.

Glazer

, and Kechris

2009. Conserved amino acid sequence features in the α subunits of MoFe, VFe, and FeFe nitrogenases. PLoS ONE, 4, e6136.

10.

Grabherr

, Haas

, Yassour

, et al. 2011. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol., 29, 644–652.

11.

Henikoff

, and Henikoff

1992. Amino acid substitution matrices from protein blocks. PNAS, 89, 10915–10919.

12.

Kent

2002. BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664.

13.

Leung

, Yiu

, Parkinson

, et al. 2013. IDBA-MT: de novo assembler for metatranscriptomic data generated from next-generation sequencing technology. J. Comput. Biol. 20, 540–550.

14.

Morozova

, and Marra

2008. Applications of next-generation sequencing technologies in functional genomics. Genomics, 92, 255–264.

15.

Parro

, Moreno-Paz

, and Gonzalez-Toril

2007. Analysis of environmental transcriptomes by DNA microarrays. Env. Microbiol., 9, 453–464.

16.

Peng

, Leung

, Yiu

, et al. 2012. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28, 1420–1428.

17.

Pettersson

, Lundeberg

, and Ahmadian

2009. Generations of sequencing technologies. Genomics, 93, 105–111.

18.

Poretsky

, Bano

, Buchan

, et al. 2005. Analysis of microbial gene transcripts in environmental samples. Appl. Environ. Microbiol., 71, 4121–4126.

19.

Poretsky

, Sun

, Mou

, et al. 2010. Transporter genes expressed by coastal bacterioplankton in response to dissolved organic carbon. Environ. Microbiol., 12, 616–627.

20.

Qin

, Li

, Raes

, et al. 2010. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464, 59–65.

21.

Schulz

, Zerbino

, Vingron

, et al. 2012. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 28, 1086–1092.

22.

Tartar

, Wheeler

, Zhou

, et al. 2009. Parallel metatranscriptome analyses of host and symbiont gene expression in the gut of the termite Reticulitermes flavipes. Biotechnol. Biofuels, 2, 25.

23.

Tatusov

, Koonin

, and Lipman

1997. A genomic perspective on protein families. Science, 278, 631–637.

24.

Xiong

, Frank

, Robertson

, et al. 2012. Generation and analysis of a mouse intestinal metatranscriptome through illumina based RNA-sequencing. PLoS ONE, 7, e36009.