Abstract
Meiotic recombination has a crucial role in the biological process involving double-strand DNA breaks. Recombination hotspots are regions with a size varying from 1 to 2 kb, which is closely related to the double-strand breaks. With the increasement of both sperm data and population data, it has been demonstrated that computational methods can help us to identify the recombination spots with the advantages of time-saving and cost-saving compared to experimental verification approaches. To obtain better identification performance and investigate the potential role of various DNA sequence-derived features in building computational models, we designed a computational model by extracting features including the position-specific trinucleotide propensity (PSTNP) information, the electron-ion interaction potential (EIIP) values, nucleotide composition (NC) and dinucleotide composition (DNC). Finally, the supporting vector machine (SVM) model was trained by using the 172-dimensional features selected by means of the
Introduction
Meiotic recombination [1] is a special kind of cell division caused by double stranded DNA breakage. During the reduction of eukaryotic cells, it affects the exchange of genetic information between non-sister chromosomes of homologous chromosomes. In this process, some regions of the genome showed higher rates of recombination relative to neutral expectations, which were named hotspots, and others were named coldspots [2, 3]. Identifying the hotspots and coldspots is one of the most important ways to understand the molecular mechanisms and the evolution of genomes.
The computational method is superior to the traditional experimental method in saving time and reducing cost. In recent years, researchers in bioinformatics and computational biology have developed several computing models based on machine learning. Jiang et al. [4] built the first computational predictor, RF-DYMHC, using to detect the hotspots and coldspots of yeast meiosis recombination from the genome sequences, in which the random forest (RF) was employed as a predictor, and the gapped dinucleotide composition information was extracted as discrimination features. This predictor achieved an accuracy of 82.05% by implementing the 2-fold cross-validation [4]. The relative recombination rate of Saccharomyces cerevisiae was estimated utilizing DNA microarrays with single-gene resolution, and 6200 genes were obtained by repeating the experiments seven times [5]. Jiang et al. [4] collected 5266 sequences from the dataset of Gerton et al., and for every sequence, the authors took its median to denote the relative recombination rate. Those sequences with a relative hybridization ratio exceed to 1.5, and a relative hybridization ratio lower than 0.82 are marked as hotspots and coldspots, respectively. As a result, the 490 hotspots and 591 coldspots were made up of the benchmark dataset. What’s more, the benchmark dataset they built lays a solid foundation for further design of computational schemes of identifying the recombination spots. Qiu et al. [6] created the predictor iRSpot-TNCPseAAC by incorporating feature extraction information, including pseudo amino acid components (PseAAC) and trinucleotide composition (TNC) with the SVM classification algorithm. In the jackknife test, the sensitivity, specificity, and accuracy of iRSpot-TNCPseAAC were 87.14%, 79.59%, and 83.72%, respectively. Similarly, iRSpot-SF was constructed by Al Maruf et al. [7], which extracted the features vector by fusing multiple feature extracting methods, 17 effective features were selected on the basis of the recursive feature elimination technique, and 84.58% accuracy was obtained by performing ten-fold cross-validation. The authors utilized an approach for recursively extracting features by combining with linear kernel SVM, the comprehensive feature vectors were sorted, and the optimal features were extracted, the overall accuracy of this method was 84.09% on the jackknife test [8]. Kabir and Hayat [9] established an ensemble predictor iRSpot-GAEnsC by embedding eight base classifiers, the final results were integrated by the simple genetic algorithm and majority voting, and the accuracy of it was 84.46%. Chen et al. [10] constructed a prediction model named iRSpot PseDNC by extracting pseudo-dinucleotide composition (PseDNC) information and local structure characteristics of DNA as feature vectors. The accuracy of iRSpot-PseDNC in the jackknife test and 5-fold cross-validation reached 82.01% and 85.19%, respectively.
By analyzing various information in biological sequences, we developed a new prediction model to further improve the performance of computational models in identifying recombination spots. The feature extraction methods fully considered the sequence information, and the physical property of the nucleotides and the position information are potential information sources for identifying the recombination spots. In this study, we attempt to design a predictor with the help of SVM classifier to identify the recombination spots by considering position-specific propensity (PSTNP) information, electron-ion inter pseudo potentials (EIIPs) values, nucleotide composition (NC) and dinucleotide composition (DNC). The optimal feature subset was selected with the aid of
The research flowchart of this paper is provided in Fig. 1. The study is implemented below: Section 2 supplies the material and methods, including the benchmark database, feature extraction, feature selection, and a basic introduction to SVM; In Section 3, we show the results based on the designed method and offers some comparisons among the existing recombination spots predictors; some conclusions are given in Section 4.
The flowchart of the proposed model.
All the experiments were conducted by using MATLAB 2020b and LIBSVM package 3.22 (
Benchmark dataset
The training dataset with high quality is a critical condition for developing computational models and comparing performance [18]. The first dataset used to identify the hotspots and coldspots was built in [9], including 490 recombination hotspots and 591 coldspots. Then, Yang et al. [19] made a compact version of the first dataset by considering the sequence length distributed information. They unified the sequence length to 131 bp for the reason that studies have shown that hotspots are correlated in high GC content regions [5]. It should be pointed out that 131 bp was selected as the sequence length due to the limitation of the shortest sequence length in their dataset, if the length of the sequence was larger than 131 bp, the part with maximum GC content was chosen. Finally, as shown in Table 1, Yang’s dataset contained 490 hotspots and 591 coldspots with the equal sequence length of 131 bp. In order to make the comparison comprehensive and unbiased, this paper uses Yang’s dataset as the benchmark dataset, and provides detailed information in the supplementary materials. For convenience, the hotspot sequences mentioned in the following description are the positives, and the coldspot sequences to the negatives.
Benchmark datasets for different species
Benchmark datasets for different species
Feature vector extraction from the biological sequences plays a pivotal role in the design of the computational model, which is conducted before learning classification algorithms. The function of cells is influenced by multiple factors, therefore, it is necessary to extract sequence information from various aspects [20]. In this section, we transformed each DNA sample into a vector with the aid of four feature-extracting strategies.
Suppose the DNA sample
Position-specific trinucleotide propensity (PSTNP) [21, 22, 23] characterizes the position information of the sequence by accounting trinucleotide composition. For the sequence with length
where
where the symbol ‘
Nair and Sreenadhan [24] proposed the Electron-ion interaction potential (EIIP) to express the electron-ion energy distribution information along the DNA sequence [25]. EIIP is an appealing feature encoding way with a remarkable distinguish function, which was also used in the enhancer and non-enhancer discrimination [21], DNA
where the subscripts denote different types of trinucleotides.
EIIP values of nucleotides
The occurrence rate of each nucleotide of the recombination spots sample in Eq. (1) can be expressed in the following form [27]:
where
Besides the nucleotide composition information, the dinucleotide composition (DNC) has underlying discriminant information for identifying the recombination spots [28], which was employed to construct the features vectors
where
As a classical feature evaluation filtering method based on statistical measurement,
Support vector machine and model evaluation
Support vector machine (SVM) is an effective supervised learning method, which can be used to solve classification or regression problems, and has a critical application in bioinformatics [31, 32, 33, 34]. In this study, the SVM model with radial basis function (RBF) is established utilizing the LIBSVM package [35], in which the penalty parameter
In the evaluation of the computational model, three methods are usually used, namely jackknife test, k-fold cross-validation test and the independence test. The jackknife test is the most commonly used test method. Each sample will be alternately selected as the test set and the rest as the training set in the process of test. By performing the above procedure, each sample plays a role in the training and test sets. It is known from the process of the jackknife test that the results are relatively objective, so the jackknife test is usually used as the performance evaluation of the predictor. Four measurements sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthew’s correlation coefficient (MCC) [18, 36, 37, 38] are employed in this part as below:
where
Features optimization
Feature selection is an essential step in establishing a computational model, which also directly affects the performance of identifying the recombination spots. The feature vectors can be evaluated by employing filter approaches, wrapper methods and embedded approaches [39, 40]. To obtain higher performance of the computing model, potential feature information is extracted from all aspects, but at the cost of high-dimensional features and poor computing efficiency. Feature selection approaches are usually used to find optimize features subset to improve identification efficiency [28, 41, 42, 43]. As mentioned in Section 2, four kinds of feature vectors were extracted from different perspectives. To get the best prediction model, we utilized the
Single feature subset optimization
Firstly, we ranked each feature type using
Optimal performances of each feature extraction method by using jackknife test
Optimal performances of each feature extraction method by using jackknife test
IFS curves for each type of feature.
Secondly, the optimized PSTNP as primary feature vector combined with three other optimal subsets together and tested by using the jackknife test, the performances of different feature combinations are shown in Table 4. In the first step, we combined the optimized PSTNP and the optimized EIIP features, which achieved the first-best and the second-best prediction performance. The combination of the optimized PSTNP and optimized EIIP has more improvement than a single type of feature, and then the NC was added furtherly. However, the new combination reduced the prediction performance. The same situation happened when we combined the features of the optimized PSTNP, the optimized EIIP and DNC. Therefore, the integration of the optimized PSTNP with 101 elements and the optimized EIIP with 40 elements was used as the final feature vector to compare with hybrid features optimization.
Performances of the different optimized feature combination by using jackknife test
In this part, we first combined the features, and then employed feature selection methods to optimize the feature combination. The identification performances with different hybrid features are listed in Table 5. As shown in Table 5, the feature dimensions with the hybrid features can reach 213, with the Acc of 97.32%. Then the 213 elements were ranked by
Performances of the hybrid features by using jackknife test
Performances of the hybrid features by using jackknife test
IFS curve for hybrid features optimization.
By comparing the best results of Tables 4 and 5, we can find that the performance of the second combination mode is better than that of the first one, which may be due to information loss in the first method. Ranking the
Classifier selection and feature extraction method are both critical to target identification. It is a common way to improve performance to consider the collaborative effect of feature subset and classifier. To choose an ideal classifier on the same feature subset, five classifiers most commonly used in solving bioinformatics tasks were selected as possible classifiers, which are K-Nearest Neighbor (KNN), Random Forest, Ensembles for Boosting, Discriminatory analysis and SVM. The performances of the five classifiers via the jackknife test are reported in Table 6. The Sn, Sp, Acc and MCC values of support vector machine are the highest, which are 98.37%, 98.14%, 98.24% and 0.965 respectively. In view of this, SVM model was selected as the final classifier in the design of identifying recombination spots model.
Comparison of different predictors for identifying hotspots and coldspots
Comparison of different predictors for identifying hotspots and coldspots
Comparison of existing predictors for identifying hotspots and coldspots.
Comparison between Liu’s method and our method on the same dataset of identifying hotspots and coldspots on 5-fold cross validation test.
In recent years, based on different feature extraction and classification methods [6, 7, 8, 9, 10, 19], several prediction factors have been proposed to identify recombination spots. To verify the performance of our predictor, we used the jackknife test to compare different models on the same dataset. It can be seen from Fig. 4 that our method has the best performance, the Sn, Sp, Acc and MCC are 18.29%, 10.07%, 13.78%, 0.275 higher than iRSpot-GAEnsC [9], respectively. It should be noticed that iRSpot-GAEnsC has the best performance on Acc among the known methods. In addition, Liu et al. [13] further processed the training dataset with the CD-HIT software to reduce redundancy. After cutting off the sequences which the similarity higher than 75%, 478 positive samples and 572 negatives were obtained. What’s more, the Sn, Sp, Acc and MCC obtained by this method are 75.29%, 88.81%, 82.65% and 0.651, respectively. The comparisons are shown in Fig. 5.
All the test results mean that our method can perform better than the existing methods. One of the underlying reasons is that an informative feature subset and an efficient classifier can provide a collusive and promotive function for enhancing the performance of identification.
Conclusion
This paper mainly studied the role of feature extraction and classification algorithms in the prediction of recombination hotspots. We designed an improved model for predicting recombination hotspots by combining PSTNP, EIIP, NC, DNC features and an SVM classifier with RBF kernel. Some effective features were selected from the hybrid features by using
Footnotes
Acknowledgments
The authors acknowledge the General Research Project of Education Department of Liaoning Province (Grant: JL202014).
