Development of Computer Algorithm for Editing of Next Generation Sequencing Metagenome Data

Abstract

The successful implementation of the advanced sequencing technology, the next generation sequencing (NGS) motivates scientists from diverse fields of biological research especially from genomics and transcriptomics in generating large genomic data set to make their analysis more robust and come up with strong inference. However, exploiting this huge genomic data set becomes a challenge for the molecular biologists. To corroborate this problem, computational software and hardware are being developed in parallel and become an integral part of life science. While executing the “Genomics project of Indian Drosophila species,” we found strings of Ns in the whole genome sequences generated on Illumina platform. The present article aims at developing a computer algorithm (MATLAB and Python based) for editing raw sequences mainly eliminating bad residues before submitting to the publicly accessible sequence repository. These algorithms will be helpful to life scientists for analyzing large amount of biological data in short span of time.

1. Introduction

With enhanced accessibility and affordability, next generation sequencing (NGS) technology has revolutionized the field of biological sciences by providing remarkable possibilities for wide applications related to sequencing of whole genome, transcriptome, or epigenome of an organism (Mardis, 2008; Schuster, 2008; Metzker, 2010). Using NGS technology, the underlying mechanism of the genes and regulatory elements associating with disease epidemiology are being well studied, thus providing insight into rare as well as fully characterized disease forms (Audo et al., 2012; Grada and Weinbrecht, 2013). In addition, a wealth of knowledge for comparative biological studies can be inferred through whole genome sequencing of a wide variety of organisms using NGS technology (Kuroda et al., 2001; Hillier et al., 2008; Ekblom and Juan, 2011). NGS technology has also transformed the field of transcriptomics (RNA-seq) by introducing the new concept of gene expression profiling in place of microarrays (Zhong et al., 2009; Tarazona et al., 2011).

Data gathering has been fuelled by NGS technology, where huge numbers of primary sequences are generated as dataset in a single experimental run. The high quality datasets are required before proceeding toward the downstream process such as assembly, annotation, single nucleotide polymorphisms identification, and others (Stapley et al., 2010). Therefore, it is necessary to perform preprocessing of these raw sequences generated through NGS technology such as removal of low quality nucleotide sequences, adapters removal, polymerase chain reaction primers and bad residues if any, as they may produce erroneous results. This requires the development of advanced computing softwares and tools for further analysis, and better interpretation of metagenome data (Fernández-Suárez and Birney, 2008; Oliver et al., 2015). Thus, the integration of computational and molecular biologists helps in developing bioinformatics methodologies to deduce meaningful information from these genomic data within short span of time.

Various software tools have been designed for removing ambiguities from large nucleotide sequence datasets such as fasta_clean, Trimmomatic, Adapter Removal, FASTX-Toolkit, SeqTrim, TagCleaner, and so on where they perform trimming and filtering of raw reads (Bolger et al., 2014; Falgueras et al., 2010; Schmieder et al., 2010; Lindgreen, 2012). However, after filtration of all these unwanted sequences, some unknown residues still remain. To clear those away, we have developed an algorithm for editing whole genome sequences (WGS) generated from NGS, which may also be helpful to other researchers dealing with huge genomic data sets.

2. Methods

2.1. Whole genome sequencing

The whole genomes sequences of Indian Drosophila fly were generated using Paired end library on Illumina NextSeq 500 platform using NGS technology (Khanna and Mohanty, 2016). The raw reads were filtered using Trimmomatic v0.30 using optimized parameters. The high quality short reads data obtained after filtering were assembled in form of scaffolds based on de novo approach using CLC Genomics Workbench (version 6.0).

2.2. Algorthims

2.2.1. Sequence editing

Single or multiline gaps in form of strings of Ns of variable length were present either at beginning or end of scaffolds. The scaffolds themselves are of variable length. Manually removing these gaps is a time consuming and error prone task.

The problem of removing these sequences has been formulated as a regular expression based pattern matching problem. Regular expressions are strings of symbols and wildcard characters (*, ?, |, etc.) that define patterns to be searched in an input character sequence (Hopcroft et al., 2006). The regular expressions designed to match for multiline input fasta files divided into scaffolds is as below: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm{RE}}1 = > { \rm{scaffold}} \left[ {0 - 9} \right] + { \rm{ }} \left[ {{ \rm{N}} + } \right] \left[ {{ \rm{A}} \left\vert { \rm{T}} \right\vert { \rm{G}} \vert { \rm{C}}} \right] \left[ {{ \rm{N}}*} \right] \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm{RE}}2 \; = \; \left[ {{ \rm{N}} + } \right] > { \rm{scaffold}} \left[ {0 - 9} \right] + \end{align*} \end{document}

A finite state machine (FSM) has been designed to find these regular expressions in the input file and eventually remove them (Ficara et al., 2008). The FSM can be represented as nondeterministic automata of Figure 1.

FIG. 1.

Finite state machine for required pattern recognition.

Sequences recorded from beginning of state q1 to end of q2 are to be deleted for initial gaps removal. Similarly, sequence of q2 recorded just before entering state q4 is to be removed. FSM of Figure 1 has been abstracted into a deterministic program implemented in Python programming language. The step by step details of the method of removal of gaps in each scaffold have been presented as pseudo code in algorithm 1. The scaffold numbers were used as delimiters to mark beginning of a scaffold. The delimiter may vary according to the type of input file.

Algorithm 1: Initial and trailing gap removal in genome sequence. Input: Assembled genome sequences as scaffolds with N gaps. Output: Draft whole genome sequence as scaffolds without gaps
1. f1← file of raw scaffold sequences
2. f2 ← file of gaps removed from beginning of scaffold sequences
3. prev_line = read_one_line(f1);
4. while (prev_line is not empty) do
5. if (find_subsequence(prev_line,'>[s\|S]caffold+'))
6. curr_line = read_one_line(f1);
7. while (first_char(curr_line,'N⁺'))
8. modified_line = left_trim(curr_line,'N⁺','');
9. write modified_line to f2
10. curr_line = read_one_line(f1);
11. end while
12. end if
13. write curr_line to f2
14. prev_line = curr_line
15. Go to step 4
16. end while
17. curr = lastline_of_file(f1)
18. repeat steps 4–16 by traversing backward
19. until beginning_of_file
20. close (f1)
21. close(f2)

The algorithm utilizes standard sub sequence finding procedures to find beginning of each scaffold. The trailing genome sequence data were then analyzed for occurrence of prefixes of one or more Ns. These variable length sequences of Ns were then removed from the original fasta file. Identification and removal of gaps at the end of scaffold were computationally more complex than removal at beginning due to repeated back and forth movement in trailing lines of each scaffold. This issue has been addressed in line no 18 of algorithm 1, by initiating traversal of fasta file again, but in reverse order. Algorithm 1 reads and processes each character of the file twice and thus does 2 m operations, where m is the total number of characters in the file. Thus, overall time complexity of search algorithm is still linear, that is, O (|m|). This complexity is comparable to many string matching algorithms proposed for this purpose (Crochemore and Rytter, 2003).

3. Results and Discussion

During Drosophila genome project, we generated WGS of Drosophila species of Indian origin. The high quality reads were assembled into scaffolds following de novo approach using CLC Genomics Workbench. It was observed that some scaffolds contained gaps in forms of strings of Ns of variable length present either at the beginning, middle, or end of the sequence (Figs. 2 –4). The gaps present in between the scaffold are of known nucleotide length, which can be substituted later by designing primers and sequencing. Therefore, the single or multiline string of Ns present in the middle of a scaffold need not be removed (highlighted in Figs. 3 and 4). However, the gaps present in the beginning and ends of the scaffold should be eliminated before submitting them to public sequence repository (NCBI). Since the large dataset involved, it was not possible to edit those scaffolds manually. To resolve the problem, the computer algorithm was written in Python and implemented while preprocessing the WGS.

FIG. 2.

Editing of string of Ns of varying length in the WGS of Drosophila as highlighted. (A) Original scaffold having strings of Ns at the beginning. (B) Edited scaffold without strings of Ns. WGS, whole genome sequences.

FIG. 3.

Editing of string of Ns of varying length in the WGS of Drosophila as highlighted. (A) Original scaffold having strings of Ns at the beginning. (B) Edited scaffold without strings of Ns.

FIG. 4.

Editing of string of Ns of varying length in the WGS of Drosophila as highlighted. (A) Original scaffold having strings of Ns at the end. (B) Edited scaffold without strings of Ns.

The pseudo code presented in algorithm 1 was implemented in Matlab and Python. The programs were run on two different machines. First was a laptop with core i5 processor and 4 GB RAM. The second machine used was a desktop with core i7 processor and 8 GB RAM. There was a drastic difference in execution time on two machines. With Python-based implementation, it took about one minute to complete the gap removal on laptop, while the same task was done in about one second on desktop. Matlab-based program took about 10 minutes to complete gap removal of one file of size 160 MB. Results of removed gaps in our fasta files are shown in Figures 2 –4. The program was also tested for various other genome sequences (Homo sapiens, NCBI accession number: ADDF00000000; Rattus norvegicus, NCBI accession number: AAXM00000000; Mus musculus, NCBI accession number: 000389885.1) retrieved from NCBI database (https://www.ncbi.nlm.nih.gov/). As they are postprocessed sequences, noises (strings of Ns of varying lengths) were generated first and the results of their removal are shown in Figures 5 –7. We were able to remove all the required gaps in all the tested files, thus achieving a 100% accuracy rate. Time required for removal using Python on desktop was also found to be of less than 1 second for all files.

FIG. 5.

Editing of string of Ns in the whole genome sequence of human. (A) Original scaffold having strings of Ns at the beginning. (B) Edited scaffold without strings of Ns.

FIG. 6.

Editing of string of Ns in the whole genome sequence of mouse. (A) Original scaffold having strings of Ns at the end. (B) Edited scaffold without strings of Ns.

FIG. 7.

Editing of string of Ns in the whole genome sequence of rat. (A) Original scaffold having strings of Ns at the beginning. (B) Edited scaffold without strings of Ns.

There are several other tools that can be used for this purpose. We tried FASTQ/A Clipper tool of FASTX toolkit (http://hannonlab.cshl.edu/fastxtoolkit) (Gordon and Hannon, 2010). It was however, not working in the desired manner and erasing complete lines with valid sequences also. Similar other tools like grep and sed (stream editor) for linux were tried (Crochemore and Rytter, 2003; Abou-Assaleh and Ai, 2004). None of the existing tools were useful in solving the given problem. Hence, this algorithm was developed and it would be useful for researchers working on similar problems.

4. Conclusion

It has become a great challenge in mining large genomic datasets generated through High-Throughput Sequencing technologies and bringing them in an easy-to- understand format, which requires an integration of both genomics and computational biology. Our developed algorithm may be helpful to other researchers while editing WGS.

Footnotes

Acknowledgments

This research was supported by extramural grants from the Department of Science and Technology (DST), India. We would like to acknowledge Jaypee Institute of Information Technology, Noida, India for providing infrastructural support.

Author Disclosure Statement

No competing financial interests exist.

References

Abou-Assaleh

, and Ai

2004. Survey of global regular expression print (grep) tools. Citeseer. Topics in Program Comprehension (CSCI 6306). Halifax, Nova Scotia, Canada.

Audo

, Bujakowska

K.M.

, Léveillard

, et al. 2012. Development and application of a next-generation-seque ncing (NGS) approach to detect known and novel gene defects underlying retinal diseases. Orphanet. J. Rare Dis. 7, 1.

Bolger

A.M.

, Lohse

, and Usadel

2014. Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, 30, 2114–2120.

Crochemore

, and Rytter

2003. Jewels of Stringology: Text Algorithms. World Scientific, Singapore.

Ekblom

, and Galindo

2011. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity, 107, 1–15.

Falgueras

, Lara

A.J.

, Fernández-Pozo

, et al. 2010. SeqTrim: A high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics, 11, 38.

Fernández-Suárez

X.M.

, and Birney

2008. Advanced genomic data mining. PLoS Comput. Biol., 4, e1000121.

Ficara

, et al. 2008. An improved DFA for fast regular expression matching. ACM SIGCOMM Comput. Commun. Rev., 38, 29–40.

Gordon

, and Hannon

G.J.

2010. Fastx-toolkit. FASTQ/A short-reads pre-processing tools. Unpublished.

10.

Grada

, and Weinbrecht

2013. Next-generation sequencing: Methodology and application. J. Invest. Dermatol. 133, 1–4.

11.

Hillier

L.W.

, Marth

G.T.

, Quinlan

A.R.

, et al. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods, 5, 183–188.

12.

Hopcroft

J.E.

, Motwani

, Ullman

J.D.

2006. Automata Theory, Languages, and Computation. International Edition. Pearson, 535 pgs. 24.

13.

Khanna

, and Mohanty

2016. Whole genome sequence resource of Indian Zaprionus indianus. Mol. Eco. Res. [Epub ahead of print]; DOI: 10.1111/1755-0998.12582.

14.

Kuroda

, Ohta

, Uchiyama

, et al. 2001. Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet, 357, 1225–1240.

15.

Lindgreen

2012. AdapterRemoval: Easy cleaning of next-generation sequencing reads. BMC Res. Notes, 5, 337.

16.

Mardis

E.R.

2008. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402.

17.

Metzker

M.L.

2010. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46.

18.

Oliver

G.R.

, Hart

S.N.

, Klee

E.W.

2015. Bioinformatics for clinical next generation sequencing. Clin Chem. 61, 124–135.

19.

Schmieder

, Lim

Y.W.

, Rohwer

, et al. 2010. TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics, 11, 341.

20.

Schuster

S.C.

2008. Next-generation sequencing transforms today's biology. Nature, 200, 16–18.

21.

Stapley

, Reger

, Feulner

P.G.

, et al. 2010. Adaptation genomics: The next generation. Trends Ecol. Evol. 25, 705–712

22.

Tarazona

, García-Alcalde

, Dopazo

, et al. 2011. Differential expression in RNA-seq: A matter of depth. Genome Res. 21, 2213–2223.

23.

Zhong

, Gerstein

, and Snyder

2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63.