Mingle: A Command Line Utility for Merging Multi-fasta Files

Abstract

Massively parallel sequencing (MPS) has become a standard technique in molecular biology whose application has spread from the analysis of the human genome to that of virtually all other organisms. MPS requires reference genomes to be performed and, in some cases, multiple genomes need to be handled as a single unit to carry out genetic analysis. Nucleic acid sequences are typically stored in “fasta” files, which can contain multiple genomes (“multi-fasta”). Although it is possible to convert a multi-fasta file into a single sequence using specific computer commands, the resulting file will not keep track of the boundaries of the original sequences, making it difficult to determine to what genome read obtained from MPS belong to. In this study we introduce mingle, a shell script that can be used to create custom reference genome by merging multi-fasta files while providing a list of boundaries of the individual genomes that can be used for downstream analysis.

1. Introduction

Massively parallel sequencing (MPS), a technique that has uncovered the complexity of the human genome (Clarke et al., 2012), has become a standard procedure in the laboratory of molecular biology that is providing valuable information related to the genomic structure and genetic expression of microorganisms (Tagini and Greub, 2017; Tang and Larsson, 2017). Apart from de novo genome assembly experiments, MPS requires a short read aligner to map the sequences obtained during the experiment to a known reference genome (Ruffalo et al., 2011; Pfeifer, 2017).

Although there are plenty of short read aligners, there are few available reference genomes limited to the most common organisms; for instance, the integrative genome viewer (IGV) (Robinson et al., 2011), which can graphically display the obtained MPS reads, is supplied with a set of 79 unique genomes belonging to 58 families of organisms, a tiny fraction of the more than three trillion sequences deposited into the GenBank (Sayers et al., 2019). The availability of reference genomes represents a bottleneck in metagenomics, for less popular reference genomes need to be prepared by the operator to perform MPS analysis.

Although it is possible to download reference sequences from the GenBank by different approaches, there are situations where it is not feasible to work with single reference sequences. For instance, it might not be known beforehand what exact organism the reads under investigation belong to, or there could be occurrences of recombination between multiple genomes. In these cases, a panel of reference sequences should be used as a single unit.

The most common file format for nucleic acids and amino acid sequences is the fasta format, which allows multiple entries on the same file (multi-fasta). In this format, each sequence contains a header introduced by the character “>” and terminated by a new line; the actual sequence is then spread into one or more lines each typically having the same number of characters (line width), usually 80 (Zhang, 2016). To restate the process of merging a multigenomic reference file using the fasta format, the resulting file should have a single header and a cluster of nucleotides derived from all the individual genomes. It is possible to perform this action with, for instance, Unix/Linux shell commands (such as: grep -v "^>" file.fa > newfile.fa). However, the individual sequences might have a variable line width that would be passed to the processed file. In addition, using this approach, it is no longer possible to find the boundaries of the individual genomes.

There are some software focused on merging fasta files. For instance, MultiFASTA Builder (www.dnabaser.com/help/tools-converters/MultiFASTA%20Builder/index.html) is a proprietary program for the Windows operative system that takes as an input a list of single fasta files and then creates a multi-fasta. The Sequence Manipulation Suite (Stothard, 2000) is a freeware program also available as an online platform (www.bioinformatics.org/sms2/combine_fasta.html) that does perform merging but does not show the boundaries of the original sequences.

In this study we introduce mingle, a utility for merging multi-fasta files that allows the definition of the output's line width and that provides the boundaries of the individual sequences for downstream applications. Mingle also performs a test on the length of the individual sequences to prove that the boundaries were properly determined.

2. Materials, Methodologies, and Techniques

Mingle version 2.3 was named after the verb “to mingle” meaning to blend, to mix, to coalesce; it was written in Bash language and tested on the Linux distributions Mint 20, openSUSE 13, and Ubuntu 18.04 running Bash version 4.4.19, 4.2.53, and 4.4.19, respectively. In this study, the path to the mingle executable was provided in the Linux environmental variables, thus it was possible to run mingle as any other shell commands simply by typing mingle <options> on a Linux terminal; alternatively, mingle could be launched as a common bash scripts. Mingle's script is given in Supplementary File S1.

Mingle has a manual page that can be displayed with the command: mingle -h. Mingle works following the syntax: mingle <input> <output> <header>, where input is the name of a multi-fasta file (including the extension; this approach was chosen because it was assumed that the user would use any kind of extension or none at all), output provides the name of the final files (one with the extension “.fa” containing the merged sequences and one with the extension “.tsv” containing the boundaries of the individual sequences), and header represents the name of the fasta header (if this is formed by multiple words, it should be provided within quotes). A restriction of mingle is that the headers of the individual sequences should be unique.

Following the “rule of silence” of the Unix philosophy (https://en.wikipedia.org/wiki/Unix_philosophy), the execution of mingle does not show any messages to the standard output; the output of the program is entirely directed to two textual files that represent the end product of the process. To assure that mingle would not stall when dealing with large files, a progression bar could be activated with the flag -i.

Mingle calculates the cumulative length of the provided sequences to determine their boundaries; to assure that the reported lengths are correct, a test can be activated with the flag -t. With this option, mingle will use the EMBOSS (Rice et al., 2000) esearch and efetch commands to extract the GenBank file corresponding to each individual sequence. To work, this option requires a running Internet connection, the EMBOSS executable set in the environment variables, and the header of the sequences formatted according to the EMBOSS specifications, that is: >IDName An Informative comment (http://emboss.open-bio.org/html/use/ch05s02.html). Mingle will use the “IDName” field to fetch the deposited length of the genome.

Mingle can also set the width of the sequence. By default, the number of characters in each row is 100; this value was chosen because it makes easier to calculate the position of a given nucleotide when manually parsing the output file. It is possible to set the width to the more canonical 80 characters with the option -w; alternatively, a custom value can be set with the option -W n, where n is an integer. A file with a heterogeneous line width was prepared by pasting random viroid nucleic acid sequences into a text file called “unevenFile.fa”. A file with amino acid sequences was prepared by pasting the transposase sequence of Aeromonas salmonicida subsp. salmonicida culture-collection (Trudel et al., 2013) into a file called “Tn.fa”.

The alignment of selected sequences was carried out with MAFFT (Katoh et al., 2002) and the visualization of the resulting alignment was obtained with Bioedit (www.mbio.ncsu.edu/bioedit/bioedit.html).

A Bash script (“tester.sh”) was prepared to benchmark the speed of retrieving information from mingle's output as compared with enquiring remote servers (Supplementary File S2).

3. Results

Mingle was designed with the aim of full automation of the merging process. To test this design, all the viroid sequences deposited on the National Center for Biotechnology Information (NCBI) database were extracted using the EMBOSS package. The viroid taxon was chosen because it could be accessed using an identification number and because it contained enough deposited sequences to provide a multi-fasta file without requiring excessive computational resources. Retrieval and counting of the instances obtained was performed with the commands:

$ esearch -db “nucleotide” -query “txid12884[Organism] AND refseq[filter]” | efetch -format fasta > viroid.fa

$ grep “^>” viroid.fa | wc -l

where “$” represents the prompt of a Linux terminal and 47 is the number of individual sequences retrieved. Mingle was then launched with the default line width of 100 characters; the original file “viroid.fa” was compared with the output file “mingled.fa”:

$ mingle viroid.fa mingled “mingled fasta file”

$ head -n 5 viroid.fa

>NC_039241.1 Eggplant latent viroid complete genome, isolate 2

GGGTGGTGTGTGCCACCCCTGATGAGACCGAAAGGTCGAAATGGGGTTTCGCCATGGGTCGGGACTTTAA

ATTCGGAGGATTCGTCCTTTAAACGTTCCTCCAAGAGTCCCTTCCCCAAACCCTTACTTTGTAAGTGTGG

TTCGGCGAATGTACCGTTTCGTCCTTTCGGACTCATCAGGGAAAGTACACACTTTCCGACGGTGGGTTCG

TCGACACCTCTCCCCCTCCCAGGTACTATCCCCTTTCAAGGATGTGTTCCCTAGGAGGGTGGGTGTACCT

$ head -n 5 mingled.fa

>mingled fasta file

GGGTGGTGTGTGCCACCCCTGATGAGACCGAAAGGTCGAAATGGGGTTTCGCCATGGGTCGGGACTTTAAATTCGGAGGATTCGTCCTTTAAACGTTCCT

CCAAGAGTCCCTTCCCCAAACCCTTACTTTGTAAGTGTGGTTCGGCGAATGTACCGTTTCGTCCTTTCGGACTCATCAGGGAAAGTACACACTTTCCGAC

GGTGGGTTCGTCGACACCTCTCCCCCTCCCAGGTACTATCCCCTTTCAAGGATGTGTTCCCTAGGAGGGTGGGTGTACCTCTTTTGGATTGCTCCGGCCT

TCCAGGAGAGATAGAGGACGACCTCTCCCCATACTGGAAACAACCTTGTGGTTCCTGTGGGTCACCCCTCCAAAGCAATAAAGGAAAAGAAAGGCGAAGA

Figure 1 shows the matching of the first three entries to demonstrate that the sequences of the output file corresponded to those of the input file. The boundaries of the individual files were reported on the “mingled.tsv” file:

FIG. 1.

Visualization of the mingling process. The figure shows the terminal's output of the shell commands used to display part of the content of the input file “viroid.fa” (A), which contained all the 47 viroid sequences deposited in the NCBI database at the time of writing, and of the output file “mingled.fa” (B), which contained a single sequence obtained by merging those 47 entries. Shades of grey are used to highlight the position of the first three genomes and demonstrate the process of sequence mingling. NCBI, National Center for Biotechnology Information.

$ head -n 3 mingled.tsv

Start	End	Length	Sequence name
1	333	333	>NC_039241.1 Eggplant latent viroid complete genome, isolate 2
334	637	304	>NC_035620.1 Lychee viroid-like RNA isolate 20160105, complete sequence

The calculated lengths were compared with those deposited into the NCBI database:

$ mingle -t viroid.fa mingle_tested “Mingled file with length test”

$ head -n 3 mingle_tested.tsv

Start	End	Length	True length	Sequence name
1	333	333	333	>NC_039241.1 Eggplant latent viroid complete genome, isolate 2
334	637	304	304	>NC_035620.1 Lychee viroid-like RNA isolate 20160105, complete sequence

To note the additional column “true length” present in “mingle_tested.tsv”. To assure that the reported boundaries corresponded to the actual location of the individual sequences, three random sequences were selected (corresponding to the GenBank IDs NC_22893.1, NC_003613.1, and NC_001351.1), and their reference fasta files were retrieved from the NCBI database and saved to fasta files. These were individually aligned to “mingled.fa” with:

$ mafft –globalpair –maxiterate 1000 –thread 10 cherry_leaf.fa > cherry_leafALN.fa

$ mafft –globalpair –maxiterate 1000 –thread 10 iresine.fa > iresineALN.fa

$ mafft –globalpair –maxiterate 1000 –thread 10 hop_stunt.fa > hop_stuntALN.fa

Figure 2 shows that the boundaries for these sequences were, respectively, 2166-2577 (NC_22893.1), 8255-8624 (NC_003613.1), and 11979-12280 (NC_001351.1). These values were compared with those provided by mingle:

FIG. 2.

Boundaries of selected sequences. Three sequences were randomly selected from the bulk of the 47 viroid genomes available and were individually aligned to the “mingled.fa” file that represented the reference genome. The figure shows only the 5′ and 3′ ends of the sequences for cherry leaf scorch small circular viroid-like RNA (A), Iresine viroid 1 (B), and Hop stunt viroid (C).

$ grep “NC_022893.1” mingled.tsv

2166 2577 412 >NC_022893.1 Cherry leaf scorch small circular viroid-like RNA 1 isolate es.10, complete sequence

$ grep “NC_003613.1” mingled.tsv

8255 8624 370 >NC_003613.1 Iresine viroid 1, complete genome

$ grep “NC_001351.1” mingled.tsv

11979 12280 302 >NC_001351.1 Hop stunt viroid, complete genome

Mingle was also tested for its capability of providing different line width:

$ mingle -w viroid.fa mingled_80 “Standard width of 80”

$ head -n 3 mingled_80.fa

>Standard width of 80

GGGTGGTGTGTGCCACCCCTGATGAGACCGAAAGGTCGAAATGGGGTTTCGCCATGGGTCGGGACTTTAAATTCGGAGGA

TTCGTCCTTTAAACGTTCCTCCAAGAGTCCCTTCCCCAAACCCTTACTTTGTAAGTGTGGTTCGGCGAATGTACCGTTTC

$ mingle -W 50 viroid.fa mingled_50 “Custom width of 50”

$ head -n 3 mingled_50.fa

>Custom width of 50

GGGTGGTGTGTGCCACCCCTGATGAGACCGAAAGGTCGAAATGGGGTTTC

GCCATGGGTCGGGACTTTAAATTCGGAGGATTCGTCCTTTAAACGTTCCT

$ mingle -W 110 viroid.fa mingled_110 “Custom width of 110”

$ head -n 3 mingled_110.fa

>Custom width of 110

GGGTGGTGTGTGCCACCCCTGATGAGACCGAAAGGTCGAAATGGGGTTTCGCCATGGGTCGGGACTTTAAATTCGGAGGATTCGTCCTTTAAACGTTCCTCCAAGAGTCC

CTTCCCCAAACCCTTACTTTGTAAGTGTGGTTCGGCGAATGTACCGTTTCGTCCTTTCGGACTCATCAGGGAAAGTACACACTTTCCGACGGTGGGTTCGTCGACACCTC

The number of characters was the same in each of these files but, as expected, the number of lines was proportional to the line width:

$ A = $(grep -v “^>” mingled.fa | wc -c)

$ B = $(grep -v “^>” mingled.fa | wc -l)

$ C = $(($A-$B))

$ echo $C/$B

15803/159

$ A = $(grep -v “^>” mingled_80.fa | wc -c)

$ B = $(grep -v “^>” mingled_80.fa | wc -l)

$ C = $(($A-$B))

$ echo $C/$B

15803/317

$ A = $(grep -v “^>” mingled_50.fa | wc -c)

$ B = $(grep -v “^>” mingled_50.fa | wc -l)

$ C = $(($A-$B))

$ echo $C/$B

15803/198

$ A = $(grep -v “^>” mingled_110.fa | wc -c)

$ B = $(grep -v “^>” mingled_110.fa | wc -l)

$ C = $(($A-$B))

$ echo $C/$B

15803/144

The data provided so far showed that mingle could convert the individual genomes in a trustful way. The final outcome of this manipulation can be seen in Figure 3. It is customary to visualize mapped MPS reads using programs such as IGV; this step requires a reference genome. By providing a concatenated file with individual genomes, IGV reports each of the entries as a separate chromosome making the navigation cumbersome. Conversely, mingle provides a single cumulative chromosome that the operators can navigate with much ease. The following data will demonstrate other features of mingle not directly associated with the mapping of reads.

FIG. 3.

Differences in the visualization of the MPS reads according to the selected reference genome. This figure shows screenshots of integrative genome viewer with different reference genomes loaded in the environment. The file “viroid.fa” contained 47 individual viroid sequences (A) and “mingle.fa” a single sequence obtained by mingling them (B). To note the difference in the number of entries (arrows on the left) and the length of the shown genomes (arrows on the right) between the two panels.

Identifying genomes parsing mingle's output was faster than enquiring remote servers: by running a Bash file purposely written to parse the local “mingled.tsv” file and to access the GenBank database, we found that the former approach could identify the IDs of selected genomes in a fraction of a second, whereas the remote access took an average of 21 ± 11 minutes (Supplementary File S3).

Since a fasta file can accommodate any custom sequence, the line width might not be constant; therefore, a direct stacking of the sequences could result in a visually unpleasant file. Mingle, in contrast, was capable of producing an output with a constant line width:

$ head unevenFile.fa

>First sequence

TGTGGGTCACCCCCCCAGCGAGTAATAAAAAGACAAGGGGTCGAGGACTCACC

GAGACGAGCGA

>Second sequence

GGCCGCTGGGCACTCCCCACCGTCCTTTTTTGCCAGTTCGCTCCAGGTTTCCCCGGGGATCCCTGAAGCGCTCCTCCGA

GCTTACCTGGCTTCCC

CGCTGAAACAGGGTTTTCACCCTTCCTTTCTTCGGGTTTCCT

>Third sequence

TCTCCCCCGCCGCTCTGGTAAGGCCCCTGCAGCGA

AGATCCCGCTAGTCGAGCGGACTTGTCGATCAACCAACGCGCTTTTCTTTTATTC

$ echo “>Heterogeneous line width” > unevenMerged.fa

$ grep -v “^>” unevenFile.fa >> unevenMerged.fa

$ head unevenMerged.fa

>Heterogeneous line width

TGTGGGTCACCCCCCCAGCGAGTAATAAAAAGACAAGGGGTCGAGGACTCACC

GAGACGAGCGA

GGCCGCTGGGCACTCCCCACCGTCCTTTTTTGCCAGTTCGCTCCAGGTTTCCCCGGGGATCCCTGAAGCGCTCCTCCGA

GCTTACCTGGCTTCCC

CGCTGAAACAGGGTTTTCACCCTTCCTTTCTTCGGGTTTCCT

TCTCCCCCGCCGCTCTGGTAAGGCCCCTGCAGCGA

AGATCCCGCTAGTCGAGCGGACTTGTCGATCAACCAACGCGCTTTTCTTTTATTC

CCTCTTGCGCGGCGGGGCGAGAGTGTTTTGCCTAAGGACCGCTGTTTGGGTCCTTGGGTGAAC

AACCTTGTGGTTCCTGTGGGTCACCCCGCCCCACGGA

$ mingle unevenFile.fa evenFile “Homogeneous line width”

$ head -n 3 evenFile.fa

>Homogeneous line width

TGTGGGTCACCCCCCCAGCGAGTAATAAAAAGACAAGGGGTCGAGGACTCACCGAGACGAGCGAGGCCGCTGGGCACTCCCCACCGTCCTTTTTTGCCAG

TTCGCTCCAGGTTTCCCCGGGGATCCCTGAAGCGCTCCTCCGAGCTTACCTGGCTTCCCCGCTGAAACAGGGTTTTCACCCTTCCTTTCTTCGGGTTTCC

TTCTCCCCCGCCGCTCTGGTAAGGCCCCTGCAGCGAAGATCCCGCTAGTCGAGCGGACTTGTCGATCAACCAACGCGCTTTTCTTTTATTCCCTCTTGCG

Since a fasta file could contain amino acid sequences, mingle was tested for its capability of dealing with them:

$ cat Tn.fa

>transposase subunit 1

MSAKRIAMRKIREVLRLRLEAQLSFRQISICTKVSVGSIQKLLK

SAQALGLSWPLPTELDDGRLAALFYPQADTTISARHVVPDWPSVHQELKRKGVSKQLL

WEEYTQRYPNSCYSYSQFCDRYKSWCQLQKRSMRQIHKAGEKLFIDYCGPTVPIVSPT

TGEVRQAQVFVAVLGASNYTFAEATWSQSLPDWLQSHVRAFEFFGGTPALLIPDNLKS

GVSKACRYDPELNPSYQQLAEHYQVAVMPARPYKPKDKAKAEVGVQIVERWILARLRH

QTFFSLAELNQCIRALLNELNERPFKQLPGNRRDAFEQLDRPALGPLPVHPYRYVAIK

TVKVNIDYHVSYEQHHYSVPHQYVGQQLELHAGDTLLQVYHQQQLVASHPRKTIPGMS

TLPEHMPERHSKQQRWTPGRLKQWAADIGPGTLCWVSERLEEKAHVEQAYRLCLGLLS

LSREYPPSRVEACCQLANREGLVRLKQLKSVLASGRDQLPASPLSYPELPQEHENIRG

PQSFH

>transposase subunit 2

MTQQTLTRLRNLKLTGMADAVQQQLEQASTYEGLPFIERLSLLV

EHEQLSREQRKQTRLVKQARLKLQATVQELDYQSARNLERSQVANLAQGEWLRRGQNL

LITGPCGCGKTYLACALGYQACQQGHSTRYYRLPRLLLELSQAKVDGSYARLLGQLAR

IELLILDDWGLEAVNAEQRNALLEIMDDRYGKYSTIVVSQLPTEEWYASLGDNTLADA

ILDRLMHNAHRLSLKGESMRKRRSELTQSEQSS

$ mingle Tn.fa mingled_aa “merged transposase”

$ cat mingled_aa.fa

>merged transposase

MSAKRIAMRKIREVLRLRLEAQLSFRQISICTKVSVGSIQKLLKSAQALGLSWPLPTELDDGRLAALFYPQADTTISARHVVPDWPSVHQELKRKGVSKQ

LLWEEYTQRYPNSCYSYSQFCDRYKSWCQLQKRSMRQIHKAGEKLFIDYCGPTVPIVSPTTGEVRQAQVFVAVLGASNYTFAEATWSQSLPDWLQSHVRA

FEFFGGTPALLIPDNLKSGVSKACRYDPELNPSYQQLAEHYQVAVMPARPYKPKDKAKAEVGVQIVERWILARLRHQTFFSLAELNQCIRALLNELNERP

FKQLPGNRRDAFEQLDRPALGPLPVHPYRYVAIKTVKVNIDYHVSYEQHHYSVPHQYVGQQLELHAGDTLLQVYHQQQLVASHPRKTIPGMSTLPEHMPE

RHSKQQRWTPGRLKQWAADIGPGTLCWVSERLEEKAHVEQAYRLCLGLLSLSREYPPSRVEACCQLANREGLVRLKQLKSVLASGRDQLPASPLSYPELP

QEHENIRGPQSFHMTQQTLTRLRNLKLTGMADAVQQQLEQASTYEGLPFIERLSLLVEHEQLSREQRKQTRLVKQARLKLQATVQELDYQSARNLERSQV

ANLAQGEWLRRGQNLLITGPCGCGKTYLACALGYQACQQGHSTRYYRLPRLLLELSQAKVDGSYARLLGQLARIELLILDDWGLEAVNAEQRNALLEIMD

DRYGKYSTIVVSQLPTEEWYASLGDNTLADAILDRLMHNAHRLSLKGESMRKRRSELTQSEQSS.

4. Discussion

This study introduced mingle, a utility for merging multi-fasta files while providing the boundaries of each original sequence. Mingle was designed to be a swift bioinformatic tool that can be used on computer clusters and as part of pipelines; it is relatively small (8.3 Kb), it does not require compilation, and works under the Unix/Linux environment, which is widely used in bioinformatic settings, but can be easily adapted to work also on Windows environment. The driving idea for developing mingle was the creation of custom genomes for MPS analysis; the user could patch together several chosen sequences and then use mingle to generate a single reference sequence while keeping track of the boundaries of each of the original genomes. When visualizing the obtained reads from the actual MPS experiment, this information could be used to determine the origin of the reads without the need for querying nucleotide databases, for instance by BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi); enquiring a local file is much faster than connecting to a remote server and does not encounter connection problems.

Another important feature that drove the development of mingle was that its use should be made as easy as possible. Although expert bioinformaticians can approach the merging of sequences writing ad hoc procedures in a variety of ways, we wanted to provide a simple tool that could be adopted by users with general computer language literacy. With a simple syntax, mingle can be applied without requiring a deep bioinformatic knowledge and, therefore, it can be defined as a user-friendly application.

The data provided herein showed that mingle could reliably merge several sequences providing a list of boundaries and genome lengths that matched the value deposited in the NCBI database.

Footnotes

Acknowledgments

The authors thank the discussion forum “Biostars” () for the support on some of the debugging issues encountered while writing mingle.

This study was supported by the Alfried Krupp von Bohlen und Halbach Foundation, Essen, the Deutsche Krebshilfe, Bonn (70112168), the Deutsche Forschungsgemeinschaft (DFG, grant number AL 465/9-1), the HEiKA Initiative (Karlsruhe Institute of Technology/University of Heidelberg collaborative effort), Dr Hella-Buehler-Foundation, Heidelberg, Ingrid zu Solms Foundation, Frankfurt, the DKFZ-MOST Cooperation, Heidelberg (grant number CA149), and the HIPO/POP-Initiative for Personalized Oncology, Heidelberg (H032 and H027).

Author Disclosure Statement

The authors declare there are no competing financial interests.

References

Clarke

, Zheng-Bradley

, Smith

, et al. 2012. The 1000 genomes project: data management and community access. Nat. Methods, 9, 459–462.

Katoh

, Misawa

, Kuma

, et al. 2002. MAFFT a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066.

Pfeifer

S.P.

2017. From next-generation resequencing reads to a high-quality variant data set. Heredity, 118, 111–124.

Rice

, Longden

, and Bleasby

. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277.

Robinson

J.T.

, Thorvaldsdóttir

, Winckler

, et al. 2011. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26.

Ruffalo

, Laframboise

, and Koyutürk

2011. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics, 27, 2790–2096.

Sayers

E.W.

, Cavanaugh

, Clark

, et al. 2017. GenBank. Nucleic Acids Res. 47, 94–99.

Stothard

2000. The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques, 28, 1102–1104.

Tagini

, and Greub

2017. Bacterial genome sequencing in clinical microbiology: A pathogen-oriented review. Eur. J. Clin. Microbiol. Infect. Dis. 36, 2007–2020.

10.

Tang

K.-W.

, and Larsson

2017. Tumour virology in the era of high-throughput genomics. Philos. Trans. R. Soc. B Biol. Sci. 372, 20160265.

11.

Trudel

M.V.

, Tanaka

K.H.

, Filion

, et al. 2013. Insertion sequence AS5 (IS-AS5) is involved in the genomic plasticity of Aeromonas salmonicida. Mob. Genet. Elements, 3, e25640.

12.

Zhang

2016. Overview of sequence data formats. Methods Mol. Biol. 1418, 3–17.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.03 MB