PROSES: A Web Server for Sequence-Based Protein Encoding

Abstract

Recently, the number of the amino acid sequences shared in online databases is growing rapidly in huge amounts. By using sequence-derived features, machine learning algorithms are successfully applied to prediction of protein functional classes, protein–protein interactions, subcellular location, and peptides of specific properties in many studies. Protein Sequence Encoding System (PROSES) is a web server designed as freely and easily accessible for all researchers who want to use computational methods on protein sequence data. That is, PROSES provides users to encode their protein sequences easily without writing any programming code.

1. Introduction

Proteins are large biological molecules that perform many vital functions in living organism such as catalyzing metabolic reactions, replicating DNA, responding to stimuli, and transporting molecules from one location to others (Zhang and Ke, 2014). There is a huge amount of protein sequence data stored in databases such as NCBI's Entrez Protein, GenPept, UniProt, and STRING, that is potentially important to understand these functions by using automatic and reliable computational approaches. Especially, by the extraordinarily fast growing rate of sequence information of proteins, attention has been paid to use structural and physicochemical features of sequences in computational methods. Structural and physicochemical features derived from protein sequence have been used in many studies to predict protein functional classes, protein–protein interactions, subcellular location, and peptides of specific properties (Li et al., 2006). Data encoding that aims to determine, quantify, and describe the residues in a protein sequence provides considerable improvement in the performance of machine learning methods from the point of learning time and predictive accuracy.

In this article, we introduce a web tool named Protein Sequence Encoding System (PROSES) to provide users residue encoding and sequence search. PROSES is an online web server designed as freely and easily accessible for all researchers who want to encode protein sequences as data preprocess for machine learning methods.

In the literature, PROFEAT (Li et al., 2006; Rao et al., 2011), iFrag (Garcia-Garcia et al., 2017), PseAAC (Shen and Chou, 2008), and offline Matlab toolbox (Zhang and Ke, 2014) are the prominent online and offline tools. However, the existing tools do not cover all of the encoding methods provided in the PROSES web server and some of these tools are not freely accessible. Moreover, with PROSES web server, we also facilitate the sequence searching, file conversion, and feature encoding for predicting protein–protein interactions. The web server is especially useful for those who are not familiar with programming languages. PROSES is available at (http://proses.yalova.edu.tr). The use of web server and detailed explanation of provided methods can be found at the web server's link (http://proses.yalova.edu.tr/help.html).

2. Proses's Modules

We design the PROSES web server with user-friendly and responsive interface. The web server consists of four main modules: protein encoding, protein sequence search, protein–protein interaction, and file converter (see Supplementary Figs. S1–S3 for flowcharts of modules). Each module is placed in the web site at different pages but the functions in each module are used in the other modules as well. For example, the functions of the search module are used in the protein–protein interaction page for generating the sequences of interacted proteins by looking at their UniProt ids. The diagram of PROSES's modules and relationship among them are shown in Figure 1.

FIG. 1.

Flowcharts of the PROSES's modules. PROSES, Protein Sequence Encoding System.

In protein encoding module, our web server provides to encode amino acid sequence with 12 different encoding methods: amino acid composition (Bhasin and Raghava, 2004), amino acid pair (Chen et al., 2007), conjoint triad (Shen et al., 2007), composition-transition-distribution (Dubchak et al., 1995), dipeptide composition (Bhasin and Raghava, 2004), composition moment vector (Ruan et al., 2005), orthonormal encoding (Maetschke et al., 2005), OETMAP (Gök and Özcerit, 2013), Taylor's venn diagram (Taylor, 1986), and residue-couple model (Guo et al., 2000).

In the protein encoding module, there are two options for inputting the sequence data: with a multiline textbox or a fasta file. After uploading the data, a table with its sequences and IDs appears. From this table, users can select the listed sequences one-by-one or all the sequences and then choose an encoding method. Finally, the encoded sequence data will appear in a popup window that provides the results in diverse file formats.

PROSES provide users to search a protein sequence by fetching the sequence from the UniProt web site (www.uniprot.org) through UniProt ids of proteins. The search module is used for querying protein sequence that can be used in encoding or protein–protein interaction module. The aim of the protein–protein interaction module is to generate a feature vector for each interacted protein pairs. Proteins in the interacted pair are encoded independently, and then the extracted feature vectors are concatenated as a final feature vector to predict the protein interactions. The users need to give interaction data in one of the txt, xls, xlsx, or csv file formats that contain uniprot ids of interacted proteins in two columns.

PROSES provides users to work with commonly used file formats. Users can keep the encoded data in txt, xls, xlsx, csv, or arff file formats and can also make conversion among these file formats.

Footnotes

Acknowledgments

We thank our undergraduate students Onur Duman and Furkan Zambak for their assistance in the building of PROSES web server.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Bhasin

, and Raghava

G.P.S.

(2004). Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem., 279, 23262–23266.

Chen

, Liu

, Yang

, et al. (2007). Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids. 33, 423–428.

Dubchak

, Muchnik

, Holbrook

S.R.

, et al. (1995). Prediction of protein folding class using global description of amino acid sequence. Proc. Natl Acad. Sci. U. S. A. 92, 8700–8704.

Garcia-Garcia

, Valls-Comamala

, Guney

, et al. (2017). iFraG: A protein—protein interface prediction server based on sequence fragments. J. Mol. Biol. 429, 382–389.

Gök

, and Özcerit

A.T.

(2013). A new feature encoding scheme for HIV-1 protease cleavage site prediction. Neural Comput. Appl. 22, 1757–1761.

Guo

, Lin

, Sun

, et al. (2000). Novel method for protein subcellular localization: combining residue-couple model and SVM. In Proceedings of Third Asia-Pacific Bioinformatics Conference, 17–21 January 2005, Singapore (pp. 117–129). Available at https://www.comp.nus.edu.sg/∼wongls/psZ/apbc2005/camera-ready/116.pdf (Accessed April 23, 2018).

Z.-R.

, Lin

H.H.

, Han

L.Y.

, et al. (2006). PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 34, W32—W37.

Maetschke

, Towsey

, and Boden

(2005). BLOMAP: An encoding of amino acids which improves signal peptide cleavage site prediction, pp. 141–150. In Proceedings of the 3rd Asia-Pacific bioinformatics conference.

Rao

H.B.

, Zhu

, Yang

G.B.

, et al. (2011). Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 39, W385—W390.

10.

Ruan

, Wang

, Yang

, et al. (2005). Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences. Artif. Intell. Med. 35, 19–35.

11.

Shen

H.-B.

, and Chou

K.-C.

(2008). PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 373, 386–388.

12.

Shen

, Zhang

, Luo

, et al. (2007). Predicting protein–protein interactions based only on sequences information. Proc. Natl Acad. Sci. U. S. A., 104, 4337–4341.

13.

Taylor

W.R.

(1986). The classification of amino acid conservation. J Theor Biol. 119, 205–218.

14.

Zhang

, and Ke

(2014). Protein encoding: A Matlab toolbox of representing or encoding protein sequences as numerical vectors for bioinformatics. J. Chem Pharm Res. 6, 2000–2007.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.27 MB