Abstract
Abstract
Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., “GT” and “AG” for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.
1. Introduction
A
For computational prediction, it is necessary to know the pattern of the sequences of splice sites, which are regions where introns are excised from the pre-mRNA, while leaving the exons intact. In general, the exon/intron boundary is called the donor (5′) splice site, whereas the intron/exon boundary is called the acceptor (3′) splice site and is conserved with the dinucleotide AG, known as the canonical splice site. Although several eukaryotic organisms contain two kinds of spliceosomes, that is, the U2-type and U12-type, the vast majority of introns are the U2-type. Most of the donor sites in U2-type introns contain the GT dinucleotide at the intron boundary, whereas the GC dinucleotide is observed in <1% of cases (Sheth et al., 2006). This donor site is recognized by the U1 snRNA of the spliceosome through base-pairing with an ACUUACCU motif; nevertheless, the donor-site pattern is not uniform and tolerates several variations in the motif, except for the GT dinucleotide at the boundary. The acceptor site almost always contains the AG dinucleotide at the intron boundary, with an even less explicit pattern around the AG dinucleotide. As such, all GT (AG) dinucleotides within the DNA molecule are defined as candidate donor (acceptor) sites and are judged as either a true or a false site.
To accurately predict splice sites, different computational algorithms have been developed, including Bayesian networks (Chen et al., 2005), support vector machine (SVM) (Degroeve et al., 2005; Baten et al., 2006; Sonnenburg et al., 2007; Wei et al., 2013; Meher et al., 2016), hidden Markov model (Pertea et al., 2001; Stanke and Waack, 2003; Zhang et al., 2010; Pashaei et al., 2016), and random forests (Pashaei et al., 2016).
Deep neural networks (DNNs) have achieved record-breaking results, primarily due to the recent revival of deep convolutional neural networks (CNNs) for processing of images (Krizhevsky et al., 2012), video, speech, and audio on highly challenging data sets using purely supervised learning. There are several successful examples of the application of DNNs to detect genomic problems. DeepSEA is used to predict chromatin effects of sequence alterations and prioritize functional single-nucleotide polymorphisms by learning a regulatory sequence code from large-scale chromatin-profiling data (Zhou and Troyanskaya, 2015). Improved performance for this task was reported using DanQ (Quang and Xie, 2016), a hybrid DNN (Graves, 2005). DeepSplice is a novel tool for splice-site prediction based on deep CNNs (Zhang et al., 2016) that differs from other methods for splice-site prediction, because predictions are made by combining the information of the donor and acceptor sites instead of each individual site. However, no DNN system for individual splice-site prediction has yet been developed; therefore, it is unknown whether such a system can achieve remarkable performance. In this study, we presented a new system for individual splice-site prediction based on DNN. The generated data set was tested with some previous methods.
2. Methods
2.1. Data set
The Homo Sapiens Splice-Site Data set HS3D, downloaded from www.sci.unisannio.it/docenti/rampone, was used to create a prediction model and test the performance of the proposed system. This data set has been used in previous methods (Zhang et al., 2010; Wei et al., 2013; Meher et al., 2016; Pashaei et al., 2016); therefore, comparisons with previous studies are relatively simple. The data set contains 2796 true donor sites, 2880 true acceptor sites, 271,937 false donor sites, and 329,374 false acceptor sites. The length of both the true and false splice-site sequences is 140 nucleotides, with the consensus nucleotides GT at positions 71 and 72 for donor sites and the consensus nucleotides AC at positions 69 and 70 for acceptor splice sites.
First, all true splice sites were selected and we randomly selected the same number of false sites. In this case, the ratio between the number of true and false splice sites was 1:1. Second, a 1:10 data set was constructed, which contained all true splice sites, and we randomly selected false splice sites with numbers 10 times greater than that for true splice sites.
2.2. DNN architecture
Each input layer consisted of sequence data encoded as one hot vector, where each position consists of a five-element vector with one nucleotide bit set to one. Two convolutional layers follow the input layer, with max-pooling layers. A hybrid model combining CNN and bidirectional long short-term memory network (BLSTM) was shown to outperform the CNN model in predicting the function of noncoding DNA (Quang and Xie, 2016); therefore, the proposed hybrid model was tested. In this study, the model without the BLSTM layer was referred to as the CNN model and that with the BLSTM layer as the hybrid model. The CNN model was compared with the hybrid model. The last two layers were fully connected and the binary output layer with SoftMax activation returned the probability of a splice site. Dropout (Srivastava et al., 2014) was used on the convolutional, BLSTM, and fully connected layers.
The Adam (Kingma and Ba, 2014) optimizer was used for training with categorical cross-entropy as a loss of function. The number of convolutional layers was set to two. Balancing between discovering more complex relationships was associated with a risk of overfitting (Karsoliya, 2012). Other hyperparameters such as the number of filters and the window sizes of the convolutional and BLSTM layers were determined based on a random search in consideration of the training and processing time (Bergstra and Bengio, 2012). The proposed model, as shown in Figure 1, was implemented using Keras (https://github.com/fchollet/keras), which is a highly modular DNN library written in Python.

A graphical illustration of the proposed model. An input sequence is encoded to one hot vector of a five-row bit matrix. Two convolutional layers follow with max-pooling layers. Only in the hybrid model, additional BLSTM layers follow. Dropout layers followed the convolutional and BLSTM layers during the training periods. After a fully connected layer, the probability of a splice site is determined. BLSTM, bidirectional long short-term memory network; LSTM, long short-term memory network.
2.3. Estimated parameters for performance evaluation
The performance evaluation of the proposed method was performed with a 10-fold cross-validation of the data set. Then, the average performance estimation was calculated. This process was repeated five times and the final average with the 95% confidence interval was reported. Estimated parameters include sensitivity (Sn), specificity (Sp), global accuracy (Q9) (Zhang and Zhang, 2002), and the Matthews correlation coefficient (Mcc), which were defined as
where
where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively.
In addition, the area under the receiver operating characteristic curve (ROC-AUC) was calculated for comparisons between models with and without a BLSTM layer.
2.4. Visualization of what the proposed model learned
The pattern learned by the proposed model was visualized as position frequency matrices (PFMs), which were generated using a similar approach described by DeepBind (Alipanahi et al., 2015). The convolutional filters of the hybrid model with the highest ROC-AUC learned from the 1:1 data set were used for visualization.
Let S = {s1, …, si, …, s140} be an input sequence and Yk, i be the output value of kth (1 ≤ k ≤ 32) convolutional filter for some position i (1 ≤ i ≤ 133). For filter k, we only consider sequence s if Yk,i > 0 for some position i. We find the position j = argmax i Yk,i, and extract subsequence sj–m+1… of length m, where m is the length of the motifs (i.e., the length of convolutional filters) in the proposed model. When
3. Results and Discussion
3.1. Comparison between the hybrid and convolutional neural network models
Figure 2 shows the comparison of ROC-AUCs between the CNN and hybrid models. At the donor site, ROC-AUC of the hybrid models was higher than that of the CNN models for 66% of units of the 1:1 data set and for 76% of units of the 1:10 data set. At the acceptor site, ROC-AUC of the hybrid models was higher than that of the CNN models for 66% of units of the 1:1 data set and 68% of units of the 1:10 data set. Table 1 shows the Sn, Sp, Q9, and Mcc values of the hybrid and CNN models. All estimated parameters of the hybrid model were higher than those of the CNN model for both the 1:1 and 1:10 data sets, although there were some differences in parameters between models. Therefore, the hybrid model, rather than the CNN model, was used in the following analyses. One possible explanation for these results is that additional BLSTM layers can improve the prediction ability. The current candidate sequences have consensus sequences in the middle, and it is assumed that the closer a base is to the middle, the more important it is for splice-site prediction. Therefore, additional BLSTM layers are helpful in that the forward LSTM can learn the successive sequence information from the 5′ end to the middle, whereas the backward LSTM can learn the information from the 3′ end to the middle.

Scatter plots comparing ROC-AUCs of the CNN and hybrid models. Results for the donor site
3.2. Comparison between the proposed model and previous methods
Table 2 shows the performance comparison of the proposed model with previous methods, such as a first-order Markov model with SVM (MM1-SVM), reduced MM1-SVM, SVM with a Bayes kernel, and MM1 with random forest (MM1-RF) (Pashaei et al., 2016). It is remarkable that the proposed model outperformed the previous methods in all performance metrics both for the donor and acceptor sites in the 1:1 data set. In the 1:10 dataset, the Sp values of DNN were obviously higher, whereas the Sn values of DNN were lower than those of other methods for both splice sites. The Mcc and Q9 values of the proposed model were higher than those of all previous methods for the donor site, whereas the Q9 value of the proposed model was lower than some of the previous methods for the acceptor site. Adjustment of the probability cutoff value of the splice site can modify the balance between Sp and Sn values. When setting the cutoff value at 0.3, the Q9 and Mcc values of the proposed model were 91.98 ± 0.2 and 82.05 ± 0.2, respectively, which were higher than those of all previous methods. The evaluation of the cutoff value adjustment is shown in Figure 3.

Adjustment of the cutoff value of probability that an input sequence is splice site. Results for the donor
DNN, deep neural network; MM1-SVM; first-order Markov model with support vector machine.
3.3. Visualization of what the proposed model learned
In PFMs generated from the first 32 convolutional filters, the PFM that is most similar to the PFM of the consensus motif of the canonical donor splice site (Burset et al., 2000) is shown in Figure 4. The PFM that is the most similar to the consensus motif of canonical acceptor site was determined in the same manner. The consensus motif and PFM generated from the model were very similar, thus visualization of the convolutional filters may have the potential to detect unknown motifs.

The consensus motifs and the motifs generated from the first convolutional layers. PFMs of the donor
4. Conclusions
In conclusion, a system to predict a splice site from the information of sequence data alone was developed using DNNs. The performance evaluation results showed that the proposed model outperformed previous methods using the same data set. In addition, combining CNNs and BLSTMs may improve prediction performance. There is an extremely active community researching deep learning; therefore, we believe that current and future insights will lead to greater performance for this task.
5. Availability
The source code and the trained models were uploaded as Deep Splice Site Prediction system (“DSSP”) to the GitHub repository (https://github.com/DSSP-github/DSSP) to facilitate the use of our classifiers and the development of similar tools. The sequence data used in training the model consisted of 140 bases; however, the system can be used for shorter sequences by complementing missing bases as “N.”
Footnotes
Author Disclosure Statement
No competing financial interests exist.
