Abstract
Abstract
The aim of this study is to diagnose the stage of renal cell carcinoma and to predict the prognosis of breast cancer by using RNA sequencing and microarray data that are representative gene expression data. To identify biomarkers for prediction, top-N genes of each class of cancer or noncancer are recommended by collaborative filtering method based on three gene similarity coefficients. We then construct a machine learning model for classification using the union of the recommended genes as the final feature set. The optimal genetic markers were used to identify the set with the highest classification performance in the model. Experiments conducted by the proposed method showed higher performance than those conducted by the machine learning model using all the gene features without performing feature selection. In addition, it showed better performance than other studies based on existing correlation-based feature selection.
1. Introduction
Precision medicine takes into consideration the diversity of patients in its prevention and treatment of diseases (Jameson and Longo, 2015). As such, it is reliant upon technology that is able to precisely and quickly analyze a large number of genes at a low cost. Recently, next-generation sequencing (NGS) has replaced the existing Sanger sequencing method as the most commonly applied method to produce the required analysis at low cost. NGS is mainly used in clinical trials to diagnose cancer and inform treatment decisions, and is contributing significantly to the generation of big genomic data (Dong et al., 2015; Ahn, 2017). As the size and types of data increase and as machine learning and deep learning technologies develop rapidly, there have been attempts to solve biomedical problems by applying technologies. A biomarker is an indicator capable of measuring the medical status of a patient accurately and reproducibly. Many studies have been conducted to identify biomarkers for use in cancer diagnosis and prognosis prediction, providing information such as cancer stage, type, and relapse possibility. However, as this field is still new, there is insufficient biological data available for analysis and technology development.
Meanwhile, the recommender system has been at the forefront of the electronic commerce field since its initial development in the 1990s. Collaborative filtering is the basis for the recommender system and is typically completed using one of two methods. The first method is user based, the similarity between two users is digitized based on their common items to identify the degree of similarity between the users. For each user, the preferences of other users with a high similarity score are used to predict the preferences of the target user. The second method is item based, the similarity between items, instead of users, is digitized and used for recommendation (Lee, 2013). Typically, these methods are able to make high-quality recommendations using the existing accumulated data, regardless of the number of users or items.
Therefore, by applying collaborative filtering, which was developed to overcome the shortcomings in electronic commerce customer data, such as short purchasing history or unpredictable preferences, to the biological domain that suffers from insufficiencies in available data sets, this study seeks to improve the identification of disease-related genes. In other words, assuming that using the recommended genomic features in machine learning fusion technology will help improve classification performance, this study introduces an efficient gene-based collaborative filtering method.
2. Related Research
2.1. Research on cancer diagnosis prediction
During the initial stages of cancer, symptoms may not be obvious. As such, it may be difficult for patients to identify or acknowledge the disease. To solve this problem, research has been conducted to improve early cancer diagnosis prediction models and to discover essential biomarkers. Single nucleotide polymorphisms (SNPs) are a common mutation that occurs in several DNA bases at a single site on the chromosome (Jagga and Gupta, 2015). Their goal is to use machine learning to discover new SNP markers, and in particular to predict the possibility of early onset of multiple myeloma, a form of deadly cancer. In addition to research for the early diagnosis of cancer, there are others who try to diagnose the early stages of the asymptomatic phase (Jagga and Gupta, 2014). Among various types of cancers, clear cell renal cell carcinoma is a cancer that has a strong resistance to chemotherapy during treatment. In this study, based on gene expression profiles, we established a model that can classify cancer's early stage (I and II) and a later stage (III and IV). Fast correlation-based feature selection (FCBF) can identify biomarkers of cancer progression classification and identify 62 genes in random forest models and achieve 76% classification performance (Zhu et al., 2007).
2.2. Research on cancer prognosis prediction
Studies on predicting and classifying cancer prognosis as “high-risk” and “low-risk” have been conducted (Chen et al., 2014; Xu et al., 2016). This study sets up a model capable of categorizing nonsmall-cell lung cancer patients within a “high-risk group” and a “low-risk group” by applying an artificial neural network for classification and prediction. This model records a classification performance of 83%. However, as only 440 patients and 12,000 microarray data are used, this study is limited by its small sample size.
In addition, studies have found that new markers surpass existing biomarkers through machine learning. There are 70 genetic markers available for diagnosis in breast cancer. However, using the support vector machine (SVM)-based recursive feature elimination, we used the newly discovered 50 gene features to improve the classification accuracy by 34%.
As already described, in the field of cancer diagnosis and prognosis prediction, there have been continuous efforts to improve prediction models and find new biomarkers using machine learning algorithms. However, these studies are confronted with several common problems. First, the data sample sizes are too small. A sufficient training data size is required to develop an effective classification model using machine learning. However, in reality, the number of patients and available data is too small compared with the actual vastness of gene expression. Second, genes can be expressed in endless ways, which results in the issue of dimensionality.
3. Gene-Based Collaborative Filtering Using The Recommender System
Separate from the cancer diagnosis and prognosis prediction method described in Section 2, there has been studies attempting to apply the recommender system to this field (Xu et al., 2016). The recommender system fusion technology has made significant advancements in various fields thus far but not in the biological domain. This study analyzes the applicability of the recommender system in examining the complex data in the science field.
In the recommender system, collaborative filtering (Hu et al., 2017) acts as the main algorithm. New items are recommended based on the similarity among items already evaluated by users. Top-N recommendation refers to the process through which the top-N items are recommended to users based on their predicted preferences. This study develops a gene-based collaborative filtering algorithm to help identify the appropriate number of genes, N, needed to be considered in the process of cancer diagnosis and to identify those genes by applying the top-N recommendation algorithm. The process is composed of two main parts. One is to calculate the gene interest (Gi) of individual patients and the second is to understand the number of genes needed for recommendation. Once both steps are completed, the top-N genes for lung cancer diagnosis are recommended.
3.1. Efficient gene-based collaborative filtering using similarity
In this section, the features of gene expression data used for cancer diagnosis and prognosis prediction are explained, and a method of achieving a more efficient Gi than the existing Gi mentioned in Section 2.2 using these features is suggested. In addition, a method of making use of Ochiai similarity (Ochiai, 1957) and Kulczynski-2 similarity (Peter and Robert, 1973) in the gene similarity matrix, one of the processes required to calculate Gi, is proposed. The process of designing a machine learning model using the genomic features recommended through the mentioned process and an evaluation of its performance is described.
3.2. Data features
In the experiment, RNA sequencing (RNA-seq) expression data and microarray data are used. RNA-seq was first commercialized in 2007, thanks to the rapid development of NGS. RNA-seq serves as an accurate and precise method for analyzing gene expression and is useful for the qualitative and quantitative measurement of gene expression (Jaccard, 1912). Microarray is a traditional gene expression measurement method that has been used since the 1990s. It investigates a large amount of gene expression levels in the target samples. Short oligonucleotides are first immobilized on the chip, and then RNA is extracted from the sample to be tested. The chip was then hybridized to measure the degree of expression of the previously labeled fluorescent dye.
There have been attempts to discover new biomarkers for various kinds of cancer using these expression amount analysis data available thus far, but due to the background noise that occurs during analysis and the high dimensionality of the data, it has been difficult to resolve the classification issue (Zhu et al., 2007). In this study, data calculated with these two methods are used for gene expression amount analysis.
3.3. Efficient gene-based collaborative filtering
The collaborative filtering process based on efficient genes proposed in this study is shown in Figure 1. The types of genes are represented graphically, and the type of a gene is indicated by shape and the expression level by darkness. Gene expression data Ge is the matrix of the gene expression levels by patient. Based on the cancer diagnosis and the objective of the expression data, the class label appropriate for each patient is shown. First, data are segmented to achieve gene recommendation by class. Then, (1) the inner product is calculated to identify the relationships with the genes using the segmented data by class. (2) Second, to measure the similarity among the genes, a gene-based similarity matrix is generated. In this study, three similarity-based methods (Jaccard similarity, Ochiai similarity, and Kulczynski-2 similarity) are suggested. (3) The matrix generated by the calculations is called the gene similarity matrix. (4) After the gene similarity matrix is generated, Gi is obtained. Then, in descending order by the calculated value, the corresponding gene names are listed.

Procedure for gene-based collaborative filtering: The types of genes are represented graphically, and the type of a gene is indicated by shape and the expression level by darkness. Gene expression data Ge is the matrix of the gene expression levels by patient. GR, gene rank.
In the generated gene rank (GR) matrix, the genes located in the high ranks are more likely to be meaningful as markers among the genes are expressed in the patients and have similar patterns. In the arranged matrix GR, based on (1) the number of top-N genes to be examined and (2) the number of patients to be studied, genes are recommended by class. In other words, among the top-N genes of a certain number of patients, the common genes among the patients are investigated. From when the first common gene appears, N increases in intervals of 10 based on the genes. The union of the recommended genes by class is organized as a final genomic feature set.
To generate a gene similarity matrix as mentioned in Section 3.1, this study suggests three methods. The similarity among the genes is measured based on Jaccard, Ochiai, and Kulczynski-2 similarity coefficients. In addition to the Jaccard-based gene similarity method applied in related research, this study seeks to apply more diverse similarity coefficients. According to related research (Jagga and Gupta, 2015), when the row and column of the
Similarity Coefficients and Gene Similarity Used
3.4. Construction of a machine learning model
Using the 40 unions by class generated as a result of Section 3.3, all combinations for the two classes, for a total of 1600 feature sets, are used to measure the performance (e.g., classification accuracy, recall, precision, F1-score, and AUC (area under the ROC curve)) by applying random forest (Breiman, 2001) and SVM (Cortes and Vapnik, 1995) machine learning models to 20% of the independent test data. Among the results, the set with the highest prediction accuracy is selected as the optimum top-N.
4. Experiment and Results
This study proposes an efficient gene-based collaborative filtering method and suggests a method to discover genomic markers with better classification performance using not only Jaccard similarity but also Ochiai and Kulczynski-2 similarities when calculating the gene similarity matrix. To verify this, comparative experiments are conducted under diverse conditions. A total of 80% of the data are used for training and 20% of the data are used for testing; 5% of the training data are used for verification.
For cancer diagnosis prediction, which is the objective of this study, gene expression data (Deng et al., 2017) on renal cell carcinoma are used. The data set is RNA-seq data used in research to develop a machine learning model capable of classifying the initial stage (I and II) and the later stage (III and IV) of cancer. The total number of patients is 487, among whom the data from 376 patients are for training and the data from 111 patients are for testing. According to cBioportal based on patient ID, each patient's disease development stage was searched, but due to inconsistency of information, instead of using existing training data, we used four patients' data. The complete data set is composed of 16,383 genomic features.
For cancer prognosis prediction, which is another objective of this study, microarray gene expression data are used. These data are used to draw up models and identify genomic markers capable of classifying “relapse” and “nonrelapse” in breast cancer patients. A total of 24,481 gene expression levels of 97 patients are observed. Among the patients, the data from 76 are for training and the data from 21 are for testing.
The results of the experiments using RNA-seq gene expression data for diagnosis under diverse experimental conditions are given in Tables 2 and 3. Depending on the type of similarity measurement basis used to calculate the gene similarity matrix, the use of the Jaccard similarity-based measurement method showed the highest recorded accuracy (shown in boldface), which was as high as 78.38% in the random forest model, and the genetic feature number was 171. The results of experiments using microarray data are given in Tables 4 and 5. When the SVM learning model based on Kulczynski-2 similarity was used, the genetic feature number was 128, and the highest accuracy (shown in boldface) of classification prediction was 85.71%. As shown in Table 6, our method outperformed other existing methods by 1.5% for RNA-seq and 5% for microarray gene expression data.
Performance Using Random Forest (RNA-Sequencing)
AUC, area under the ROC curve.
Performance Using Support Vector Machine (RNA-Sequencing)
Performance Using Random Forest (Microarray)
Performance Using Support Vector Machine (Microarray)
Comparison with Other Models
FCBF, fast correlation-based feature selection; MBEGA, Markov blanket-embedded genetic algorithm.
As a result of the experiments, it was found that recommendations reflected into the training data performed significantly better than when just all features, without genomic marker recommendations, are used for training.
As shown in Tables 2 and 3, when the RNA-seq gene expression data were used to diagnose renal cell carcinoma, the genome similarity matrix based on Jaccard similarity was used for genome tagging, and when the random forest model was used, it had a very good classification performance. In experiments using microarray data to predict recurrence of breast cancer, as shown in Tables 4 and 5, the recommended genes were measured using a similarity based on Kulczynski-2. When using the SVM model, we get the best performance.
5. Conclusion and Future Tasks
This study seeks to resolve issues in data overflow and data set size, both of which hinder analysis of biological data, by applying a recommender system fusion technology to the discovery of genomic markers for cancer diagnosis and prognosis prediction. First, to integrate information more effectively than in existing gene-based collaborative filtering methods, this article recommends genes by class for integration. Second, using the recommended genomic markers, during the performance evaluation, a classification and prediction model using machine learning was constructed. By applying this model to renal cell carcinoma stage diagnosis, classification performance was found to have improved compared with when existing correlation-based feature selection methods are used. Finally, when using the method proposed in this study, the number of gene signatures used for classification prediction was reduced from 16,383 to 171 in renal cell carcinoma stage prediction data. In data predicting breast cancer recurrence, 24,481 genes were reduced to 128 genes. The data used in this study are RNA-seq and microarray data. In the future, results can be made more precise by applying a wider range of data such as gene mutation data, clinical trial data, and epidemiologic data.
In this study, we used collaborative filtering based on three similarities to create different feature sets for each of the recommended genes. In addition to this method, it is expected that using the recommended gene crossover in the three methods will improve performance. The number of genetic markers found in this study exceeded 100. However, because the classification accuracy is higher than the classification accuracy previously obtained using the same data study, if the genes that were not found in previous studies are identified, an interesting analysis is expected to be obtained.
Footnotes
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant No. 2018R1D1A1B07048790).
Author Disclosure Statement
The authors declare there are no competing financial interests.
