New Machine Learning Applications to Accelerate Personalized Medicine in Breast Cancer: Rise of the Support Vector Machines

Abstract

Artificial intelligence, machine learning, health care robots, and algorithms for clinical decision-making are currently being sought after in diverse fields of clinical medicine and bioengineering. The field of personalized medicine stands to benefit from new technologies so as to harness the omics big data, for example, to individualize and accelerate cancer diagnostics and therapeutics in particular. In this overarching context, breast cancer is one of the most common malignancies worldwide with multiple underlying molecular etiologies and each subtype displaying diverse clinical outcomes. Disease stratification for breast cancer is, therefore, vital to its effective and individualized clinical care. The support vector machine (SVM) is a rising machine learning approach that offers robust classification of high-dimensional big data into small numbers of data points (support vectors), achieving differentiation of subgroups in a short amount of time. Considering the rapid timelines required for both diagnosis and treatment of most aggressive cancers, this new machine learning technique has important clinical and public applications and implications for high-throughput data analysis and contextualization. This expert review describes and examines, first, the SVM models employed to forecast breast cancer subtypes using diverse systems science data, including transcriptomics, epigenetics, proteomics, and radiomics, as well as biological pathway, clinical, pathological, and biochemical data. Then, we compare the performance of the present SVM and other diagnostic and therapeutic prediction models across the data types. We conclude by emphasizing that data integration is a critical bottleneck in systems science, cancer research and development, and health care innovation and that SVM and machine learning approaches offer new solutions and ways forward in biomedical, bioengineering, and clinical applications.

Introduction

Diagnosis and treatment of cancers remain the most formidable challenges in health care innovation. A critical bottleneck in this context is data integration and sense-making. The recent rise of artificial intelligence, machine learning, and other high-throughput data analysis algorithms offers new prospects. In this context, the support vector machine (SVM) is a supervised machine learning algorithm for classifying data sets (Schölkopf et al., 2013). SVM has been reported to display higher performance compared with the other traditional supervised classifiers such as decision tree (Rokach and Maimon, 2005) and supervised artificial neural network (Zhang, 2016). SVM classifies high-dimensional big data into small numbers of data points (support vectors), thus achieving differentiation of subgroups in a short amount of time.

Considering the rapid timelines required for both diagnosis and treatment of most aggressive cancers, this new machine learning technique has important clinical and public applications and implications for high-throughput data analysis and contextualization.

This expert review describes and examines, first, the SVM models employed to forecast breast cancer subtypes (BCSs) using diverse systems science data, including transcriptomics, epigenetics, proteomics, and radiomics, as well as biological pathway, clinical, pathological, and biochemical data. Then, we compare the performance of the present SVM and other diagnostic and therapeutic prediction models across the data types.

Support Vector Machines

Relevance to personalized medicine and cancer research

In the SVM algorithm, a decision boundary, forming two sides of different classes, is drawn to separate the data points. This imaginary boundary is called hyperplane. During classification, data points belonging to these different sides of the hyperplane are optimized in a way to provide the highest distance margin between two sides of data. The data points at the acceptable proximity to the boundary are called support vectors. They are used for the coordinate optimization of the decision boundary.

This technique is important for its application in classification and selection of informative features, such as genes, proteins, and pathways. Yet, the classification of big data is still computationally expensive. Thus, in various applications, a hybrid methodology of the SVM, including a data filtration component, is observed. Besides, high dimensionality (multiple parameters) in applied data sets has no negative effect over SVM. Therefore, groups having small sample sets and a high number of dimensions can be properly classified (Gao et al., 2017).

Breast cancer is the most frequent malignant disease among women worldwide and is also the leading cause of cancer mortality after lung cancer (Bray et al., 2018). Several clinical and pathological features such as tumor size and grade, lymphovascular invasion of tumor cells, and axillary lymph node (ALN) status are reported as important factors in the determination of breast cancer besides the presence of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor-2 (HER2) on the surface of the tumor cell and the expression level of Ki-67 (Coates et al., 2015; Yamanouchi et al., 2019). Breast cancer is also known for having heterogeneity under both spatial and temporal dimensions, causing problems during the dissection process (Agner et al., 2014; De Ronde et al., 2014; Hsu et al., 2012; Tyanova et al., 2016; Vidić et al., 2018; Yersal and Barutca, 2014).

According to the current classification system (Goldhirsch et al., 2013), BCSs are defined as luminal A (ER and/or PR positive, HER2 negative, low levels of the protein Ki-67), luminal B (ER and/or PR positive, high levels of Ki-67) with HER2 negative, luminal B with HER2 positive, HER2 enriched (ER and PR negative), and basal like (triple-negative breast cancer, TNBC). It is important to identify the BCSs not only for evaluating the prognosis of the disease but also for providing adequate therapy since drug responses of subtypes show variations (Januškevičienė and Petrikaitė, 2019) (Fig. 1).

FIG. 1.

A conceptual description of disease stratification using support vector machine applications.

In this study, we presented a brief review of BCS classification through SVM models, discussed the applicability of various data types in these models, and compared the accuracy of the present prediction models. We also emphasize the applicability of SVM in disease stratification using breast cancer as a case study. It should be noted that once the problem of lack of data integration strategies is solved, the predictive performance of the models would highly improve.

Omics-Based Prediction Models

There are SVM models that were developed using x-omics data sets, which include high-throughput data from different biological levels (Table 1). Depending on the quality of data, these models present sufficient prediction accuracies.

Table 1.

Selected Studies with Significant Relevance to Stratification of Breast Cancer Subtypes Using Support Vector Machines

Approach	Technology/data	SVM type	Reference
Transcriptome	RNA-seq	v-SVM	Sokolov et al. (2016)
	RNA-seq	SVDD	Sokolov et al. (2016)
	miRNA	SVM with 5-fold CV	Hsu et al. (2012)
	Microarray	SVM	De Ronde et al. (2014)
Methylome	DNA methylation	SVM with 5-fold CV	Flanagan et al. (2010)
Proteome	SILAC	SVM One-vs-Rest	Tyanova et al. (2016)
Proteome	2D-DIGE	SVM-LOOCV	Waldemarson et al. (2016)
Pathway	Pathway enrichment analysis	SVM	Wu et al. (2017)
Pathway	Pathway enrichment analysis	SVM with 20-fold CV	Graudenzi et al. (2017)
Radiome	MRI	SVM with LOOCV	Sutton et al. (2016)
	Ultrasound	SVM with 3-fold CV	Guo et al. (2018)
	DCE-MRI	SVM	Agner et al. (2014)
	DWI	SVM	Vidić et al. (2018)
Clinical–pathological	Patient clinical data and tumor characteristics	SVM	Wu et al. (2014)
Biochemical	Raman spectra of lipids, nucleic acids, and proteins	SVM	Becker-Putsche et al. (2013)

2D-DIGE, 2-dimensional, differential in-gel electrophoresis; LOOCV, leave-one-out cross-validation; miRNA, microRNA; DCE-MRI, dynamic contrast-enhanced magnetic resonance imaging; RNA-seq, RNA sequencing; SILAC, stable isotope labeling with amino acids in cell culture; SVDD, support vector data description; SVM, support vector machine; v-SVM, variant of support vector machine.

Expression levels of coding and noncoding RNAs (ncRNAs) differ in physiological conditions and specific stages of development and therefore are frequently used to construct predictive models based on intercondition expression differences of messenger RNAs (mRNAs) and ncRNAs. Similarly, mRNA and microRNA (miRNA) transcriptome data sets produced using different technologies, such as RNA sequencing (RNA-seq) or microarray, are among the most frequently used omics data sets employed with machine learning techniques, including SVM for the differentiation of BCSs (De Ronde et al., 2014; Hsu et al., 2012; Lan et al., 2018; Sokolov et al., 2016). Typically, SVM methods are applied by using two or more classes. However, contrary to two-class methods, negative and positive classes are also given together without affecting the accuracy of model performance in the one-class SVM (Sokolov et al., 2016).

Various machine learning methods employing transcriptome data sets are used in the stratification of BCSs. However, the accuracies of methods vary depending on the applied data set or the sampling used. For instance, through employment of RNA-seq data and performance comparison using the area under the curve (AUC) score, Sokolov et al. (2016) reported the outperformance of a variant of SVM (v-SVM) and logistic regression (LREG) (Dayton, 1992) compared with support vector data description (Tax and Duin, 2004). On the other hand, in another study, a similar performance comparison methodology was employed, using microarray data instead of RNA-seq, to identify subtype-specific chemotherapy response predictors.

As a result, the optimal predictors for different subtypes were identified through nearest mean (Dabney and Storey, 2007), naive Bayes (Rish, 2001), and 3-nearest neighbor (Akkus and Güvenir, 1996) along with LREG, while the SVM prediction model failed in performance (De Ronde et al., 2014). Furthermore, in the case of employing RNA-seq-based miRNA expression data in distinguishing BCSs (Hsu et al., 2012), the SVM with fivefold cross-validation indicated higher performance in terms of the AUC metric when compared with methods involving the Fisher score (Gu et al., 2012) and Hellinger distance (Lan et al., 2018; Wu and Karunamuni; 2014). High accuracy of SVM was also shown in the identification of potential miRNA biomarkers using subtype-specific microarray data.

Besides transcriptome data, genome-wide DNA methylation profiling (methylome) data were also considered in familial breast cancer to identify distinct profiles defined by mutation status. Flanagan et al. (2010) analyzed transcriptome and methylation profiles to compare the mutation status (BRCA1, BRCA2, and BRCAx) and intrinsic BCSs. The SVM with fivefold cross-validation classification by using gene expression data yielded 100% accuracy in predictions of intrinsic subtypes and 90% accuracy in prediction of the BRCA1 mutation, whereas predictions for BRCA2 and BRCAx failed. In the case of methylation profiles, intrinsic BCSs failed, while mutation studies improved (Flanagan et al., 2010). Several other studies evaluating epigenetic data through machine learning prediction models yielded a high potential in some of the classification case studies (Alag, 2019; Lo Bosco et al., 2016; Robertson, 2005).

Considering that proteomics reflects cellular functions more than information from genomics, epigenetics, and transcriptomics; breast cancer subtyping was also studied at the proteomic level (Tyanova et al., 2016; Waldemarson et al., 2016). Quantifying proteins through stable isotope labeling with amino acids in cell culture technology and SVM with the one-vs-rest approach, Tyanova et al. (2016) achieved stratification of BCSs with high accuracy. In addition, Waldemarson et al. (2016) employed two-dimensional, differential in-gel electrophoresis and microarray technologies to obtain proteomic and transcriptomic data sets, respectively. They performed pairwise comparisons among subtypes and presented a clear distinction between basal-like and luminal A tumors using SVM with the leave-one-out cross-validation (LOOCV) approach.

The predictive performance of omics-based approaches could be improved once the quality of the data is enhanced.

Pathway-Based Prediction Models

The idea that harnessing the pathway information and relationships between the cellular molecules can help reduce or eliminate the noise in omics data has paved the way for pathway-based, higher performance prediction models.

The regulatory pathways such as JAK/STAT, inflammatory mediator regulation of TRP channels, and glutamatergic synapse as well as disease pathways, including basal cell carcinoma, nonsmall cell lung cancer, and amyotrophic later sclerosis, were reported as common pathway signatures of BCSs (Karagoz et al., 2015; Turanli et al., 2019). The SVM-based classification of BCSs using pathway-based biomarkers was accurate (>90%) in the true prediction of luminal and basal-like patients, yet inaccurate in prediction of other BCSs (Wu et al., 2017).

Similarly, metastatic behavior of breast tumors was also studied using pathway-based prediction modeling, where type 1 diabetes mellitus, cytokine–cytokine receptor interaction, and Hedgehog signaling were associated with the nonmetastasis group. An overall accuracy slightly lower than 90% was achieved using SVM with 20-fold cross-validation (Graudenzi et al., 2017). These studies indicate the possible contribution of pathway-based biomarkers as features in the diagnosis and prediction of breast cancer subtyping and staging.

Imaging-Based Prediction Models

In addition to omics-based and pathway-based models, imaging-based prediction models are also frequently used. Radiomics provides quantitative imaging features enabling visualization of the phenotypes as being correlated with genetic information (Forghani et al., 2019). Acquired imaging data can be applied for development of prediction models that can provide higher characterization. In this sense, magnetic resonance imaging (MRI) and ultrasonography are widespread techniques for breast cancer diagnosis, and several studies employed the SVM process to improve the interpretation of images and to observe whether characteristics of images could provide differentiation among BCSs.

MRI data were analyzed through the SVM process with LOOCV (Sutton et al., 2016). However, the performance of the predictions was questionable and the accuracy was limited in each subtype. Similarly, analysis of high-throughput ultrasound features and information of cancer biomarkers through SVM with threefold cross-validation resulted in limited accuracy in terms of AUC values (Guo et al., 2018). On the other hand, Agner et al. (2014) showed superior performance of using computer-aided diagnosis (CAD) methods with dynamic contrast-enhanced MRI (DCE-MRI) in differentiating BCSs. Although DCE-MRI is reported as a sensitive technique in the detection of TNBC and screening BRCA mutation carriers, it was found to be problematic due to high similarity in imaging profiling of triple-negative lesions and benign fibroadenomas.

However, the employment of CAD methods has been proposed to increase diagnostic specificity of DCE-MRI since high accuracy (97%) could be achieved through SVM classification of these subgroups. In addition, integration of diffusion-derived parameters (mean, standard deviation, skewness and kurtosis of apparent diffusion coefficient, relative enhanced diffusivity, and intravoxel incoherent motion) with MRI boosted the performance of SVM significantly in the classification of benign and malignant breast tumors (Vidić et al., 2018).

Prediction Models Utilizing Clinical, Pathological, and Biochemical Data

A rich source of information on tumors can be gathered through clinical, pathological, and biochemical analyses, which should be considered as an indispensable part of cancer research especially for entailing machine learning predictions under critical and uncertain situations (Shah et al., 2019; Visweswaran et al., 2010). In this sense, a study that could be accepted as a tutorial example was presented by Wu et al. (2014), where they constructed the SVM model to classify BCSs as well as positive and negative ALN metastasis groups based on pathological information of the primary tumor and clinical features (such as age at diagnosis, tumor size, positive lymph node, histology grade, ER status, PR status, HER2 status, and the presence of lymphovascular invasion).

The SVM model correctly predicted ALN metastases in 75% of patients using pathological and clinical information. The predictive ability of BCSs using subgroup analysis showed no difference and this predictive performance was inferior, with only 60% accuracy. In the biochemical perspective, Becker-Putsche et al. (2013) constructed the SVM model using protein, lipid, and nucleic acid data acquired from Raman spectroscopy. The research was applied over six breast cancer cell lines representing different BCSs on the single-cell level. The classification performance on the cell level was observed as 97%.

Conclusions and Outlook

This expert review examined the employment of SVM prediction models in classification of BCSs by using diverse data types. The predictive performance of SVM methods involving radiomic data was significantly higher and almost without failure in discriminating BCSs. Radiomics provides big data consisting of features that are usually easy to interpret and therefore is practical considering its noninvasive feature. Radiomics and applications of machine learning approaches are likely to receive greater research and clinical interest in the near future.

In the case of consistency and cost-effectiveness considerations, omics- and pathway-based prediction models, which present acceptable accuracy, should be preferred. Thus, repeating the same or similar experimental studies could be prevented by gene expression repositories that are freely available and focusing on data-intensive computation. Despite requiring more applications for improving the consistency of data, information at the biochemical, clinical, and pathological levels could be the key feature for further classification studies.

Although these studies collectively show that biological data at various levels are indeed useful in SVM applications, it is clear that models based on the integration of diverse data will yield markedly more accurate results. Further studies should be expected to increase in this direction, and in particular, the development of new SVM models for integration of radiomic images with molecular omics data would be of great interest for health care innovation.

Disease stratification for improved cancer care and treatment is vital to realize the overarching aim of personalized medicine. SVM is a promising approach for development and application of the best medical practices and will be encountered more frequently in 2020 and the coming decade as the concept of precision/personalized medicine moves from theory to mainstream clinical practice.

We conclude by emphasizing that data integration is a critical bottleneck in systems science, cancer research and development, and health care innovation and that SVM and machine learning approaches offer new solutions and ways forward in biomedical, bioengineering, and clinical applications.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

No funding was received for this article.

Abbreviations Used

References

Agner

, Rosen

, Englander

, et al. (2014). Computerized image analysis for identifying triple-negative breast cancers and differentiating them from other molecular subtypes of breast cancer on dynamic contrast-enhanced MR images: A feasibility study. Radiology, 272, 91–99.

Akkus

, and Güvenir

. (1996). K nearest neighbor classification on feature projections. Proc ICML, 1, 12–19.

Alag

. (2019). Machine learning approach yields epigenetic biomarkers of food allergy: A novel 13-gene signature to diagnose clinical reactivity. PLoS One, 14, e0218253.

Becker-Putsche

, Bocklitz

, Clement

, Rösch

, and Popp

. (2013). Toward improving fine needle aspiration cytology by applying Raman microspectroscopy. J Biomed Opt, 18, 047001.

Bray

, Ferlay

, Soerjomataram

, Siegel

, Torre

, and Jemal

. (2018). Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 68, 394–424.

Coates

, Winer

, Goldhirsch

, et al. (2015). Tailoring therapies—Improving the management of early breast cancer: St. Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2015. Ann Oncol, 26, 1533–1546.

Dabney

, and Storey

. (2007). Optimality driven nearest centroid classification from genomic data. PLoS One, 2, e1002.

Dayton

. (1992). Logistic Regression Analysis. Stat. 474–574.

De Ronde

, Bonder

, Lips

, Rodenhuis

, and Wessels

LFA

. (2014). Breast cancer subtype specific classifiers of response to neoadjuvant chemotherapy do not outperform classifiers trained on all subtypes. PLoS One, 9, e88551.

10.

Flanagan

, Cocciardi

, Waddell

, et al. (2010). DNA methylome of familial breast cancer identifies distinct profiles defined by mutation status. Am J Hum Genet, 86, 420–433.

11.

Forghani

, Savadjiev

, Chatterjee

, Muthukrishnan

, Reinhold

, and Forghani

. (2019). Radiomics and artificial intelligence for biomarker and prediction model development in oncology. Comput Struct Biotechnol J, 12, 995–1008.

12.

Gao

, Ye

, Lu

, and Huang

. (2017). Hybrid method based on information gain and support vector machine for gene selection in cancer classification. Genomics Proteomics Bioinformatics, 15, 389–395.

13.

Goldhirsch

, Winer

, Coates

, et al. (2013). Personalizing the treatment of women with early breast cancer: Highlights of the St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer. Ann Oncol, 24, 2206–2223.

14.

Graudenzi

, Cava

, Bertoli

, et al. (2017). Pathway-based classification of breast cancer subtypes. Front Biosci, 22, 1697–1712.

15.

, Li

, and Han

. (2012). Generalized fisher score for feature selection. Proceedings of the 27th Conference on Uncertainty in Artificial, Intelligence, 266–273. Barcelona, Spain.

16.

Guo

, Hu

, Qiao

, et al. (2018). Radiomics analysis on ultrasound for prediction of biologic behavior in breast invasive ductal carcinoma. Clin Breast Cancer, 18, e335–e344.

17.

Hsu

, Liu

, Chang

, and Chen

. (2012). Cancer classification: Mutual information, target network and strategies of therapy. J Clin Bioinforma, 2, 16.

18.

Januškevičienė

, and Petrikaitė

. (2019). Heterogeneity of breast cancer: The importance of interaction between different tumor cell populations. Life Sci, 239, 117009.

19.

Karagoz

, Sinha

, and Arga

. (2015). Triple negative breast cancer: A multi-Omics Network Discovery Strategy for candidate targets and driving pathways. OMICS, 19, 115–130.

20.

Lan

, Peng

, McGowan

, Hutvagner

, and Li

. (2018). An isomiR expression panel based novel breast cancer classification approach using improved mutual information. BMC Med Genomics, 11, 118.

21.

Lo Bosco G, Rizzo R, Fiannaca A, La Rosa M, and Urso A. (2016). A Deep Learning Model for Epigenomic Studies. In: 12th International Conference on Signal-Image Technology Internet-Based Systems (SITIS) 688–692. Naples, Italy.

22.

Rish

. (2001). An Empirical Study of the Naïve Bayes Classifier. IJCAI 2001 Work Empir Methods Artif Intell, 3, 41–46.

23.

Robertson

. (2005). DNA methylation and human disease. Nat Rev Genet, 6, 597–610.

24.

Rokach

, and Maimon

. (2005). The Data Mining and Knowledge Discovery Handbook Decision Trees. Springer, Boston, MA: Springer Science+Business Media, Inc.

25.

Schölkopf

, Luo

, and Vovk

. (2013). Empirical Inference. Berlin, Heidelberg: Springer-Verlag.

26.

Shah

, Kendall

, Khozin

, et al. (2019). Artificial intelligence and machine learning in clinical development: A translational perspective. NPJ Digit Med, 2, 69.

27.

Sokolov

, Paull

, and Stuart

. (2016). One-class detection of cell states in tumor subtypes. Pac Symp Biocomput, 21, 405–416.

28.

Sutton

, Dashevsky

, Oh

, et al. (2016). Breast cancer molecular subtype classifier that incorporates MRI features. J Magn Reson Imaging, 44, 122–129.

29.

Tax

, and Duin

. (2004). Support vector data description. Machine Learn, 54, 45–66.

30.

Turanli

, Karagoz

, Bidkhori

, et al. (2019). Multi-omic data interpretation to repurpose subtype specific drug candidates for breast cancer. Front Genet, 10, 420.

31.

Tyanova

, Albrechtsen

, Kronqvist

, Cox

, Mann

, and Geiger

. (2016). Proteomic maps of breast cancer subtypes. Nat Commun, 7, 10259.

32.

Vidić

, Egnell

, Jerome

, et al. (2018). Support vector machine for breast cancer classification using diffusion-weighted MRI histogram features: Preliminary study. J Magn Reson Imaging, 47, 1205–1216.

33.

Visweswaran

, Angus

, Hsieh

, Weissfeld

, Yealy

, and Cooper

. (2010). Learning patient-specific predictive models from clinical data. J Biomed Inform, 43, 669–685.

34.

Waldemarson

, Kurbasic

, Krogh

, et al. (2016). Proteomic analysis of breast tumors confirms the mRNA intrinsic molecular subtypes using different classifiers: A large-scale analysis of fresh frozen tissue samples. Breast Cancer Res, 18, 69.

35.

, and Karunamuni

. (2014). Profile Hellinger distance estimation. Statistics, 4, 1–30.

36.

, Tseng

, Yang

, et al. (2014). Prediction of axillary lymph node metastases in breast cancer patients based on pathologic information of the primary tumor. Med Sci Monit, 8, 577–581.

37.

, Wang

, Jiang

, Lu

, and Tian

. (2017). A pathways-based prediction model for classifying breast cancer subtypes. Oncotarget, 8, 58809–58822.

38.

Yamanouchi

, Kuba

, and Eguchi

. (2019). Hormone receptor, human epidermal growth factor receptor-2, and Ki-67 status in primary breast cancer and corresponding recurrences or synchronous axillary lymph node metastases. Surg Today [Epub ahead of print]; DOI: 10.1007/s00595-019-01831-8.

39.

Yersal

, and Barutca

. (2014). Biological subtypes of breast cancer: Prognostic and therapeutic implications. World J Clin Oncol, 5, 412–424.

40.

Zhang

. (2016). A gentle introduction to artificial neural networks. Ann Transl Med, 4, 370.