Prokaryotic Contig Annotation Pipeline Server: Web Application for a Prokaryotic Genome Annotation Pipeline Based on the Shiny App Package

Abstract

Genome annotation is a primary step in genomic research. To establish a light and portable prokaryotic genome annotation pipeline for use in individual laboratories, we developed a Shiny app package designated as “P-CAPS” (Prokaryotic Contig Annotation Pipeline Server). The package is composed of R and Python scripts that integrate publicly available annotation programs into a server application. P-CAPS is not only a browser-based interactive application but also a distributable Shiny app package that can be installed on any personal computer. The final annotation is provided in various standard formats and is summarized in an R markdown document. Annotation can be visualized and examined with a public genome browser. A benchmark test showed that the annotation quality and completeness of P-CAPS were reliable and compatible with those of currently available public pipelines.

1. Introduction

As prokaryotic genome sequencing becomes increasingly cheaper and commonplace, there is a greater demand for a more easy-to-use genome analysis tool for individual laboratories. There are generally two types of prokaryotic genome annotation pipelines available for users: (1) a remote web server-based service or (2) a locally installable stand-alone package. The web server-based pipeline, represented by RAST (Rapid Annotations using Subsystems Technology) (Aziz et al., 2008), provides the genome annotation on a remote web server through sequence submission and alerts users by e-mail. The web server-based pipelines use predetermined processes that are initiated by data submission, which can raise concerns about data privacy. In contrast, use of a locally installed stand-alone pipeline, represented by Prokka (Seemann, 2014), avoids issues of data transfer and security and can be adjusted by the users. However, the installation requires relevant computing environments.

To complement these issues and provide a more generally accessible genome annotation pipeline, we developed a web-based application (webApp) termed P-CAPS (Prokaryotic Contig Annotation Pipeline Server). The pipeline is managed under the ShinyR package, a webApp framework for R. P-CAPS can be easily deployed on the local Linux system equipped with various open-source bioinformatics tools for genome annotation. Users can also launch P-CAPS as their own webApp service in a Shiny server. The pipeline is maintained with an R script language without complete knowledge about HTML or JavaScript. P-CAPS lets users embed additional R scripts for their own analyses. Because the pipeline is handled directly in the web browser, users can plug external programs such as JBrowse (Skinner et al., 2009) into P-CAPS. A graphical overview written in R markdown is provided as a summary report.

2. Description

2.1. Input

The input file of the pipeline consists of nucleotide sequences (contigs) in FASTA format with the extension “.fna,” “.fasta,” or “.fa.”

2.2. Annotation pipeline

P-CAPS is divided into two steps: (1) gene prediction and (2) functional annotation. In the gene prediction step, we implemented a consensus algorithm using two gene predictors, Prodigal (Hyatt et al., 2010) and GeneMarkS (Besemer et al., 2001), for improving the prediction accuracy. The outputs from the programs are com-pared and merged into a refined set by the consensus algorithm. The structural RNA regions are predicted by RNAmmer (Lagesen et al., 2007) for ribosomal RNAs (rRNAs) and by transfer RNA (tRNA)scan-SE (Lowe and Eddy, 1997) for tRNAs. Predicted coding sequences (CDSs) are annotated with orthologous genes based on a BLAST (Boratyn et al., 2013) search across the UniRef90 protein database (The UniProt Consortium, 2010) or other protein databases. The KEGG ORTHOLOGY (KO) number is assigned according to the transitive relations in UniProtKB (The UniProt Consortium, 2010). For the analysis of predicted CDSs at the domain level, a protein domain annotation procedure is implemented in P-CAPS using InterProScan (Jones et al., 2014).

2.3. Output

P-CAPS provides the annotation result in various standard file formats such as FASTA, GFF3, and GBFF (GBK), along with a graphical summary report written in R markdown. The summary report presents basic statistics of genome annotation such as sequence length, GC ratio, and annotated feature counts. The annotation completeness by Benchmarking Universal Single-Copy Ortholog (BUSCO) (Simão et al., 2015) and the functional categorization by KO are included in the summary report (Fig. 1).

FIG. 1.

Example of a graphical summary report from P-CAPS. The summary report is created with R markdown and is downloadable in DOCX, PDF, and HTML formats. P-CAPS, Prokaryotic Contig Annotation Pipeline Server; rRNA, ribosomal RNA; tRNA, transfer RNA.

2.4. Visualization

For manual inspection of the genome annotation, the user can use JBrowse (Skinner et al., 2009), which is integrated in P-CAPS. By default, the annotation results are automatically converted into the format required to launch JBrowse. Any genome browser can then be used to visually scan the annotation results.

2.5. Availability

P-CAPS is an open source application based on the Shiny R package. P-CAPS is demonstrated and serviced at http://panflam.korea.ac.kr/pcaps/mainpage, and a stand-alone version is downloadable from https://github.com/choilab/P-CAPS. The installation guide and manual for the stand-alone version are available from https://github.com/CompSynBioLab-KoreaUniv/P-CAPS/wiki.

3. Results

To evaluate the performance of P-CAPS, we benchmarked the package against publicly available annotation programs (RAST and Prokka) using NCBI (National Center for Biotechnology Information) reference genomes from various taxa: Escherichia coli str. K-12 substr. MG1655 (GCA_000005845.2), Lactobacillus plantarum WCFS1 (GCA_000203855.3), Bacillus subtilis subsp. subtilis str. 168 (GCA_000009045.1), Eubacterium limosum KIST612 (GCA_000152245.2), and Vibrio sp. EJY3 (GCA_000241385.1). P-CAPS showed comparable performance to the other programs for predicting major genomic features such as the CDSs, tRNAs, and rRNAs (Table 1). We checked the annotation quality according to the completeness of genome annotation using BUSCO (Simão et al., 2015). The annotations from P-CAPS and Prokka showed near completeness of the BUSCO checkup over the tested NCBI RefSeq genomes (Table 2). RAST missed a few CDSs from incongruent start sites of the RefSeq, which was due to short and fragmented protein sequence prediction (Fig. 2). The functionality of P-CAPS as a web application with a Shiny server is tested in the server of Korea University (http://panflam.korea.ac.kr/pcaps/mainpage).

FIG. 2.

Example of the missed and fragmented BUSCOs annotated by RAST. The results were not in accordance with the annotation from P-CAPS and other pipelines. The RAST annotation showed (a) one missed BUSCO and (b) five fragmented BUSCOs on the Escherichia coli genome. BUSCOs, Benchmarking Universal Single-Copy Orthologs; CDS, coding sequence; NCBI, National Center for Biotechnology Information; RAST, Rapid Annotations using Subsystems Technology.

Table 1.

Numbers of Annotated Coding Sequences, Ribosomal RNAs, and Transfer RNAs in Genome Annotations from Prokaryotic Contig Annotation Pipeline Server, NCBI RefSeq, Prokka, and RAST

Annotation source	CDS	rRNA	tRNA
Escherichia coli str. K-12 substr. MG1655
P-CAPS	4366	22	88
NCBI RefSeq	4386	22	89
Prokka	4305	22	88
RAST	4437	22	86
Lactobacillus plantarum WCFS1
P-CAPS	3158	16	72
NCBI RefSeq	3063	15	70
Prokka	3123	16	71
RAST	3194	16	74
Bacillus subtilis subsp. subtilis str. 168
P-CAPS	4310	30	86
NCBI RefSeq	4178	30	86
Prokka	4214	30	87
RAST	4381	30	86
Eubacterium limosum KIST612
P-CAPS	4172	16	58
NCBI RefSeq	4013	16	59
Prokka	4083	16	59
RAST	4284	16	57
Vibrio sp. EJY3
P-CAPS	4884	28	121
NCBI RefSeq	4728	28	120
Prokka	4801	28	121
RAST	4871	28	120

Five complete genome sequences from NCBI GenBank were used for the benchmark test.

CDS, coding sequence; NCBI, National Center for Biotechnology Information; P-CAPS, Prokaryotic Contig Annotation Pipeline Server; RAST, Rapid Annotations using Subsystems Technology; rRNA, ribosomal RNA; tRNA, transfer RNA.

Table 2.

Summary of the Benchmarking Universal Single-Copy Orthologs Completeness Test of Five Benchmarked Genome Annotations

Annotation source	Complete	Fragmented	Missing	Completeness
Escherichia coli str. K-12 substr. MG1655
P-CAPS	781	0	0	100.0%
NCBI RefSeq	780	0	1	99.9%
Prokka	781	0	0	100.0%
RAST	775	5	1	99.2%
Lactobacillus plantarum WCFS1
P-CAPS	443	0	0	100.0%
NCBI RefSeq	443	0	0	100.0%
Prokka	443	0	0	100.0%
RAST	442	1	0	99.8%
Bacillus subtilis subsp. subtilis str. 168
P-CAPS	525	1	0	99.8%
NCBI RefSeq	525	0	1	99.8%
Prokka	525	1	0	99.8%
RAST	514	7	5	99.7%
Eubacterium limosum KIST612
P-CAPS	249	4	1	98.1%
NCBI RefSeq	246	4	4	96.9%
Prokka	249	4	1	98.1%
RAST	247	5	2	97.3%
Vibrio sp. EJY3
P-CAPS	451	1	0	99.7%
NCBI RefSeq	452	0	0	100.0%
Prokka	451	1	0	99.7%
RAST	447	5	0	98.9%

The following BUSCO databases were used according to the taxon of each genome: Enterobacteriales odb9 for E. coli str. K-12 substr. MG1655, Lactobacillales odb9 for L. plantarum WCFS1, Bacillales odb9 for B. subtilis subsp. subtilis str. 168, Clostridia odb9 for E. limosum KIST612, and Gamma proteobacteria odb9 for Vibrio sp. EJY3.

BUSCO, Benchmarking Universal Single-Copy Ortholog.

Footnotes

Acknowledgments

This work was supported by Institute of Life Science and Natural Resources and the BK21plus program at Korea University.

Author Disclosure Statement

No competing financial interests exist.

References

Aziz

R.K.

, Bartels

, Best

A.A.

, et al. 2008. The RAST server: Rapid annotations using subsystems technology. BMC Genomics. 9, 75.

Besemer

, Lomsadze

, and Borodovsky

2001. GeneMarkS: A self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618.

Boratyn

G.M.

, Camacho

, Cooper

P.S.

, et al. 2013. BLAST: A more efficient report with usability improvements. Nucleic Acids Res. 41, W29–W33.

Hyatt

, Chen

G.L.

, LoCascio

P.F.

, et al. 2010. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 11, 119.

Jones

, Binns

, Chang

H.Y.

, et al. 2014. InterProScan 5: Genome-scale protein function classification. Bioinformatics. 30, 1236–1240.

Lagesen

, Hallin

, Rødland

E.A.

, et al. 2007. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108.

Lowe

T.M.

, and Eddy

S.R.

1997. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964.

Seemann

2014. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 30, 2068–2069.

Simão

F.A.

, Waterhouse

R.M.

, Ioannidis

, et al. 2015. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 31, 9–10.

10.

Skinner

M.E.

, Uzilov

A.V.

, Stein

L.D.

, et al. 2009. JBrowse: A next-generation genome browser. Genome Res. 19, 1630–1638.

11.

The UniProt Consortium. 2010. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 38, D142–D148.