Abstract
Recently, a deep learning-based enhancing Position-Specific Scoring Matrix (PSSM) method (Bagging Multiple Sequence Alignment [MSA] Learning) Guo et al. has been proposed, and its effectiveness has been empirically proved. Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model. It is capable of handling different input data set and various computing algorithms to train the enhancing model, then eventually improve the PSSM quality for those proteins with insufficient homologous sequences. In addition, EPTool equips several convenient applications, such as PSSM features calculator, and PSSM features visualization. In this article, we propose designed EPTool and briefly introduce its functionalities and applications. The detailed accessible instructions are also provided.
1. Introduction
Position-Specific Scoring Matrix (PSSM) features, which reflect per-residue evolution patterns in the sequence profile, are commonly used in the structure property prediction. The quality of PSSM features is basically determined by the underlying multiple sequence alignments (MSAs). MSA requires searching the query amino acid sequence through a large-scale sequence database, for example, UniRef (Suzek et al., 2007) and UniClust (Mirdita et al., 2017). Several softwares such as Jackhmmer (Wheeler and Eddy, 2013) and hhblits (Remmert et al., 2012) are able to search the homologous proteins of the target protein from public data set, then utilize them to calculate PSSM. The program EPTool is designed to enhance the PSSM quality of proteins with low-quality PSSM features. The enhanced PSSM features then can be used during the inference phase of protein secondary structure prediction.
The deep learning models implemented in the EPTool is used to extract the local and long distance information of the amino acids sequence. Benefit from such process is, the low-quality PSSM is then converted into high-quality PSSM. The program utilizes the Jackhmmer software to extract the MSA from Uniref50 data set, then such MSA is input into designed program to generated high-quality PSSM features.
EPTool provides several methods to compute PSSM. The users can select and try different PSSM calculation methods to generate the PSSM features, and evaluate their performance on their own prediction networks, or they can combine them to form new features. The generated PSSM features can not only be applied in the protein secondary structure prediction, but also be used in other tasks such as solvent accessibility, and dihedral angles prediction.
The users are free to use their own ways to get MSA data to train the unsupervised PSSM enhancing models, including but not limited to use the Jackhmmer for extracting MSA from the Uniref50 data set.
EPTool contains a variety of features for relevant development. (1) The script to run Jackhmmer (Wheeler and Eddy, 2013) for MSA extraction is included. (2) Three computation methods are provided to get PSSM features from MSA. (3) A trained unsupervised PSSM enhancing model is presented to generate high-quality PSSM features. (4) A training from scratch PSSM enhancing model is also available for the user to train their own enhancing model. (5) The comparison of original PSSM versus enhanced PSSM is accessible through the visualization of the gray-scale image. Figure 1 is an example of 6O4M protein generated by EPTool.

Gray-scale images of the PSSMs.
EPTool is constructed for maximum flexibility; thus, the enhanced PSSM can be used directly in the inference phase of the prediction task without retraining the prediction network. The PSSM calculation module delivers an easy access to modify the computing method of PSSM, which enables the users to implement their own PSSM calculating algorithm in the source code of EPTool. Hence, EPTool can be beneficial for further development.
EPTool contributes to the community from three aspects. First, EPTool can be used as a tool for generating PSSM features from MSA, which can be involved in various protein structure property prediction tasks. Second, EPTool can generate enhanced PSSM features for proteins with low-quality PSSM, and further help improve the performance of protein secondary structure prediction. The prediction network is flexible and subject to the user's own willing. Third, the open source enhancing PSSM network can be embedded into the framework of any prediction tasks, and we encourage members of the machine learning and bioinformatics communities to participate in the development of EPTool.
2. Method
The associative online material is available at: https://github.com/xiaozhi0689/EPTool, which describes the detailed tutorial for using EPTool. The functionality and specific instructions, as well as examples for the following subjects are provided: (1) downloading relevant softwares and required data set based on the instructions; (2) downloading and running EPTool, handling programs and menus, and exporting original PSSM features; (3) following the tutorial to handle the models, and generating the enhanced PSSM features; (4) comparing and visualizing the gray-scale images for the original PSSM and enhanced PSSM; (5) utilizing other data sets to train the enhancing PSSM model from scratch, details about the model implementation may refer to Bagging MSA (Guo et al., 2020).
Footnotes
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
This study was partially supported by US National Science Foundation IIS-1718853, the CAREER grant IIS-1553687, and Cancer Prevention and Research Institute of Texas (CPRIT) award (RP190107).
