Abstract
Biomarker plays an important role in early disease diagnosis including cancer. The World Health Organization defines a biomarker as any structure or process in the body that is measurable and affects the prognosis or outcome of the disease. Today, biomarkers can be identified using bioinformatics tools. The detection of biomarkers in the field of bioinformatics is considered more as a problem of feature selection. Many feature selection algorithms have been used for biomarker discovery however these algorithms do not have enough accuracy or have computational complexity. For this reason, the researchers discard the high accuracy algorithms because they are time consuming. We redesigned an efficient algorithm based on parallel algorithms. We used the Cancer Genome Atlas (TCGA) including breast cancer patients. The proposed algorithm has the same accuracy and increases the speed of algorithm.
Introduction
Any matter, structure, or process that is measurable in the body and can affect the prediction or trend of the disease is known as a biomarker [1, 2]. Biomarkers can diagnose the disease before clinical symptoms appear, so it is vital to use them in early detection of diseases [3]. The discovery of cancer biomarkers results in the early detection of cancer, which will have a significant impact on the mortality rate of the disease. Biomarkers are obtained from the analysis of biomolecules such as DNA, RNA and proteins and can themselves be proteins, genes, hormones and enzymes [4].
The discovery of biomarkers is a matter of feature selection in bioinformatics, especially when the distinction between features is important [5]. Feature selection means finding a subset of attributes with the minimum possible size that contain the necessary information for the intended purpose. In the biomarker detection problem, we also encounter a large number of features and samples. The goal is to select a subset of the minimal features that are very close and efficient to the examples. The feature selection algorithms, in addition to returning a set of features as output, reduce the dimension and the redundancy of data and increase the accuracy [6]. With the help of these algorithms, it is possible to identify and validate biomarkers in several steps. First, the needed genomic and proteomic data is collected and organized by databases. In addition, unnecessary data is deleted. The second step involves feature selection and, in some cases, use classification methods. Appropriate tools should validate the candidate biomarkers at a later stage [7].
Feature selection algorithms
Feature selection algorithms can be grouped into three categories: filter, wrapper, and embedded. Filter approach work based on the inherent nature of the data. These types of algorithms are usually simple and do not have high computational complexity. However, these algorithms are not very accurate and usually not stable. It means, for each execution, it usually returns different attributes [8]. An example of these algorithms is the Relief algorithm that is unstable and has been used for the detection of biomarkers [9]. Wrapper algorithms use classification methods for ranking features. Wrapper algorithms have high computational complexity and are not sufficiently fast, but due to using classification, the relationships between features are considered and algorithms are highly accurate [8]. One of the most widely used algorithms for biomarker discovery is subset that was used as a wrapper algorithm [10]. Because this algorithm has constraints as a feature selection algorithm, a combination of this algorithm and a recursive feature elimination algorithm (SVM-RFE)1
SVM-RFE: Support Vector Machine - Recursive Feature Elimination.
Feature selection algorithms generally have different performances on different data types in terms of biology [8]. In addition, due to the growing amount of data in biology area, the choice of fast algorithms has priority. Only the filter algorithms with this data volume can return the output at an acceptable time. As stated, these algorithms do not have the required precision in detecting cancer biomarkers, while misinterpreting biomarkers can cause many problems [8]. The use of wrapper algorithms is now increasing rapidly due to the high accuracy but the computational complexity of these algorithms and their running time is significant [12]. One of the main solutions to reduce the computational complexity of algorithms is the parallelization method. Parallelization means increasing the number of processors and dividing the work or data into several sub-tasks so, the calculations are divided between the processors and the response time is reduced [13, 14].
The SVM-RFE algorithm is one of the best feature selection algorithms for biomarker detection and is considered by many researchers in this field. This algorithm is stable, meaning that the output of the algorithm is constant at each repeat. It also has good accuracy in detecting biomarkers, and it is very time-consuming just because of the use of a support vector machine algorithm [15, 16]. This algorithm performs very well on genomic data [12]. In this paper, first, the dataset is introduced then the parallelization method for this algorithm is described.
Data source
The Cancer Genome Atlas (TCGA) is a large dataset of genomic variation in more than 33 types of cancer, which is valuable for computing tools. This dataset contains the genomic changes, DNA sequences, and gene expression in a cell tumor relative to a healthy cell. This study covers gene expression for more than a thousand patients with breast cancer and includes the expression of genes, metabolism, and clinical data for 2012–2015 [15]. By integrating these collections, the number of samples is more than 1,800 patients and the number of features is more than 2000 features. This data consists of rows as samples (patients) and columns show features. The second column is the patient’s status, which consists of two modes: live or dead, that plays the role of the tag. In fact, we are looking for biomarkers that play the biggest role in survival or death.
mSVM-RFE
In 2005, a different implementation of the SVM-RFE algorithm was introduced which at each step instead of one support vector machine, it uses several support vector machines [16]. This algorithm was known as a tool to select the effective genes in cancer classification and it returned better features than the SVM-RFE algorithm. In addition, the quality of the classifier is better [16]. Due to using several classes at the classification stage, this algorithm was called the mSVM-RFE algorithm.
The algorithm receives the data from the user, then
And the ranking score for each feature such as
The proposed parallel algorithm.
In the next step, the feature with the smallest ranking is eliminated. The second and third steps are repeated until for all features, their ranking be calculated. In the end, five (or more) features with the highest ranking are returned as an output [16]. In this algorithm, as in SVM-RFE, in each iteration, we can eliminate several features instead of one by one.
This algorithm is more expensive than the SVM-RFE algorithm because it uses multiple SVMs instead of one machine. However, this cost will ultimately lead to a better feature selection and more accurate ranking. In addition, one of the ways that we can increase the stability of algorithms is to select the features on different subsets [16]. Both algorithms were performed on the breast, lung, colon, and blood dataset and the best features for breast cancer and blood cancer were obtained by mSVM-RFE. On colon cancer, the selected features by SVM-RFE were better [16].
We used R studio in this research implemented the mSVM-RFE algorithm in R [17]. There are several libraries for parallelism in R that the most important is the Rmpi library [18]. Suppose that
In the first step, the
In general, the time complexity of algorithms in the recursive part will be as follows:
Where
Suppose that we have
Finally, the algorithm estimates generalization error using a varying number of top features. In this paper, the error is calculated for four features. It means the error for each fold, estimated using the first feature then, using the first two features then, the first three features, and finally all four features. These four functions are executed in parallel on processors and each fold is assigned to one processor. Therefore, the computational complexity of this part of algorithm will be decreased. The parallel algorithm is described as Fig. 1.
Experimental results
We execute the proposed parallel algorithm on a system with Intel Core i3
The accuracy of the parallel algorithm
We executed the algorithm with data consisting of 500 samples and 600 features, and we considered the number of subsamples generated by the
The output of the sequential and parallel mode.
Parallel runtime relative to the sequential mode can eventually reduce by the number of processors, which is ideal because runtime depends on other things, such as data communication between processes. In all sequential implementations, we set the number of subsamples (
In Fig. 3 the results obtained by executing parallelized mSVM-RFE are compared with sequential methods. The vertical axis represents the number of samples and the horizontal axis shows execution time. It can be observed from Fig. 3 that the execution time of parallel algorithms has been a significant reduction and reports a 60 percent reduction. In this proposed parallel algorithm, some parts of the algorithm have not been parallelized. When the number of features is bigger than 100, the features will be cut in half each round. This part of the algorithm takes much time and executes sequentially. For this reason, we have not more reduction.
The execution time of parallel vs Sequential mSVM-RFE.
The execution time of parallel mSVM-RFE with different subsamples.
In the next experiment, the number of samples was identical. First, we tested for 4 subsamples and then 8 subsamples. In this comparison, the sample size is 500 and the number of features is 600. In addition, in this evaluation, we do not consider the execution time for the generalization error.
Figure 4 shows the execution of the algorithm in a situation where the number of subsamples is 4 and 8. As we can see, in the case of 4 processors, the best runtime is achieved because each processor has only one dedicated subsample and thus more parallelism. When each processor has more than one subsamples, each processor begins to reduce features down to less than 100, which is time-consuming. In addition, each processor must run more machines due to having more subsamples. By comparing the runtime in single-core with four-cores, we find that the speed of algorithm has more than doubled. Although with 4-cores, we expect the runtime to be nearly four times higher, when the number of subsamples exceeds the number of processors, we cannot assign subsamples to the processors at one stage. Therefore, if we have 8 subsamples and 2 processors, we need to assign subsamples to processors in 4 steps, which is time-consuming. Besides, when the processors finish, the value of the features must be collected from different processors and then integrated into the master processor.
As shown in Fig. 4, with the increase in the number of subsamples, the run-time is significantly increased. However, we should not forget that the reason for creating more subsamples is actually increasing the accuracy of the algorithm. As the graphs show, when the number of subsamples is equal to the number of processors, the best balance is achieved.
The classification error (Cross-Validation).
The speed up.
As it was stated before, the algorithm estimates classification error for each fold in the external 10-fold. Classification error can eventually indicate which error is less fault and more suitable for classification. In addition, by comparing the classification error in parallel mode with sequential mode, we can show how the accuracy of the algorithm has changed. This error has been evaluated for data with a sample size of 500 and 600 attributes. As seen in Fig. 5, in both mSVM-RFE and parallel mSVM-RFE the minimal classification error can be reached when we use two features. It is observed that the classification error in two sequential and parallel modes is much closed. It means that parallelism has not reduced the accuracy of the algorithm.
Figure 6 shows the rate of speeding up resulting from parallelism. This experiment is run on two computers running 2 and 4 simultaneous processing and in all cases, the number of subsamples is also 4. The speedup of 4 processes into one process is better than the two processes to one process. In the case of 4-cores, only one sample is assigned to each processor, but in 2-cores, each processor has two subsamples. Therefore, as the number of processors increases, we observe more speed. On a computer with 4 processors, the rate of acceleration is about two and a half. The best rate of acceleration occurred at 500 data size, which in quad-processor was about 2.8 times. In addition, in dual-processor it is about 1.90. It can be said that in this case, the best data distribution has been done, which has been proportional to the number of processors and sub-samples as well as the volume of sub-samples. We did not parallelized a small percentage of the algorithm (classification error), which is the reason for the difference between the obtained rate and the ideal state.
In addition to the parallelized part of algorithm, the classification error can also be estimated in parallel. This error is calculated for each sub-sample for the final properties. Therefore, the error calculation for each sub-instance can be done in parallel so that each processor can calculate the error for each sub-instance.
In this implemented algorithm, the support vector machine has a linear kernel that does not support non-numeric data, therefore for a wider use of the algorithm, its kernel can be changed to support non-numeric data as well. In addition, we mentioned that the algorithm for each sub-sample starts to decrease the number of features from the beginning, that is, for example, if we have 8 sub-samples and the number of features is 500, the algorithm starts reducing the number of features 8 times from 500 to 100 or less because each reduction is based on a different sub-pattern. This process can be summed up as doing it once at the beginning of the algorithm, although it may affect the algorithm’s accuracy a bit.
