Identification of Genes Involved in Breast Cancer Metastasis by Integrating Protein–Protein Interaction Information with Expression Data

Abstract

The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein–protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.

1. Introduction

Breast cancer metastasis is a complex biological process regulated by multiple important genes. Due to the fact that differences in gene expression of different tumor cells from patients determine the metastatic potential among patients, there has been a trend for breast cancer metastasis study based solely on the analysis of large-scale gene expression data. Selecting small sets of genes that could be used for diagnostic purposes in clinical practice is critical for the treatment and prognosis of breast cancer. Therefore, there is a strong incentive to propose new methods that are able to effectively identify a small set of genes correlated with breast cancer metastasis.

Recently, a number of gene selection methods have been proposed for breast cancer metastasis prediction, including statistical test methods and correlation coefficient methods. The first category used the cox survival analysis models or other statistical tests to determine the role of the most significant gene sets (van de Vijver et al., 2002; van't Veer et al., 2002; Wang et al., 2005; Weigelt et al., 2005). For example, two large-scale expression studies by van't Veer et al. (2002) and Wang et al. (2005) each have identified ∼70 gene markers that were 60%–70% accurate for successful metastasis prediction. However, these two gene sets only shared three common genes. This will be described as an independent factor gene screening that cannot be obtained with high-robustness gene sets. The second category used a correlation coefficient (such as Pearson's correlation coefficient) between genes to establish coexpressed gene subnet or module to filter out the most relevant collections (Chuang et al., 2007; Ruan et al., 2010; Chen and Deem, 2013). However, the changes in gene chip types and experimental environment inevitably lead to changes in the value of the correlation coefficient, making it difficult to have a universal method calculated based on the coexpression subnet or module. Therefore, it is reasonable to incorporate new effective feature information and develop effective models to analyze the breast cancer risk for an individual patient.

Recently, the protein–protein interaction (PPI) network information has been proven to be effective in picking out the gene sets that are more explanatory (Binder and Schumacher, 2009; Johannes et al., 2010; Garcia et al., 2012; Wang et al., 2012, 2013). In this study, we attempted to incorporate this important information based on a new algorithm called PPIRF for breast cancer metastasis prediction. Our contributions here are not only in incorporating PPI information into the breast cancer metastasis prediction but also in putting GeneRank sort (Morrison et al., 2005) as an addition to the variable importance of Random Forests (RFs) classifier. The 10-fold cross-validation (CV) results demonstrated that the PPI as a priori knowledge of the genes can improve the accuracy. Furthermore, about 50 genes were identified as markers by the method. These genes can discriminate patients who developed distant metastases from those remaining metastasis free for 5 years.

2. Materials

2.1. Breast cancer data sets

In this study, we selected four types of breast cancer data sets (GEO accession Nos. GSE6532, GSE2034, GSE7390, GSE11121) from the GEO (www.ncbi.nlm.nih.gov/geo/) database (Barrett et al., 2007). These data sets were obtained from the node-negative breast cancer patients who only received surgical treatment without systematic chemical treatment. The microarray platform of the data sets used was Affymetrix HG-U133a. Raw data were preprocessed by the R package named “affy” (robust multichip average; Irizarry et al., 2003). Finally, we get the data sets consisting of 809 patients with 22,283 genes in each patient. Table 1 listed the breast cancer data sets used in the study.

Table 1.

Breast Cancer Data Sets Used in the Study

Data set	Number of patients	Metastasis	Nonmetastasis	Median relapse time (years)
GSE6532	125	28	97	3.87
GSE7390	198	62	136	5.21
GSE11121	200	46	154	2.61
GSE2034	286	107	179	2.65
Total	809	243	566	3.59

2.2. Protein–protein interaction data

The Human Protein Reference Database (HPRD) (Prasad et al., 2009) is a database of curated proteomic information pertaining to human proteins. By mapping the proteins to the gene presented on the microarray in the experiments, we got 39,240 interactions between 7932 genes. Particularly, there is a situation when multiple protein isoforms for a gene exist. As long as a relationship between one protein isoform and another protein isoform in this situation is confirmed, an intersection between the two genes of these protein isoforms will be set. We obtained the PPI data from the HPRD (June 29, 2010). The element \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{h}}_{{ \rm{ij}}}} \in \{ 0 , 1 \} $$ \end{document} of interaction matrix H (7932 × 7932) indicates the existence of interactions between i-th gene and j-th gene. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {h_{ij}} = \begin{cases}1 \quad if \ gene \ i \ and \ gene \ j \ interact \\ 0 \qquad \qquad \qquad \qquad \quad otherwise\end{cases} \tag{1} \end{align*} \end{document}

3. Methods

3.1. Overview of the method

As shown in Figure 1, the PPIRF algorithm combines two classical gene selection methods based on the RF classifier. The first one is to consider the important score of the mean decrease Gini (MDG) criteria and recursive genetic screening. The second one is to apply GeneRank algorithm to calculate gene rank by integrating with gene expression data and PPI information. By combining the two sorting methods, the new sort of features (NewRank [NR]) with a priori knowledge was extracted and integrated into the RF classifier. Finally, we selected four data sets of patients with node-negative breast cancer to measure the classification results of PPIRF.

FIG. 1.

The flowchart of PPIRF method. PPI, protein–protein interaction; RF, Random Forest.

3.2. Random forest

RF is a classifier that contains multiple decision tree, which has been applied successfully for various bioinformatic problems in recent years (Breiman, 2001; Liaw and Wiener, 2002; Strobl et al., 2009). For each tree, it not only uses a bootstrap sample to build each of the classification trees but also randomly uses a certain percentage of all the features. A lot of research has demonstrated that RFs can provide a variable ranking mechanism to select important variables. Therefore, this study applies the RF method to deal with large numbers of predictor variables in the process of random feature selection.

3.3. Variable selection technique

The selection of a smaller number of variables before sample classification is critical for breast cancer gene selection. To select a subset of variables that has classification capability, the study uses two variable selection methods, including mean decrease Gini method and GeneRank method. The final procedures of variable selection are based on the combination of the variable importance scores.

3.3.1. Mean decrease Gini method

To establish variable importance, this study first uses the MDG method for variable selection (Strobl et al., 2007). MDG is the sum of all decreases in Gini impurity due to a given variable, normalized by number of trees (Menze et al., 2009). The Gini index at node v of RF, Gini(v), is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Gini ( v ) = \mathop \sum \limits_{t = 1}^T { \hat p_t^v} ( 1 - \hat p_t^v ) \tag{2} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{p}_t^v$$ \end{document} is the proportion of class-t observations at node v. The Gini information gain of gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_i} ( { \rm{i}} = 1 , \ldots , { \rm{N}};{ \rm{N \ represents \ the \ number \ of \ genes}} )$$ \end{document} for splitting node v, Gain ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_i} , v$$ \end{document} ), is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Gain ( {X_i} , v ) = Gini ( {X_i} , v ) - {w_L}Gini ( {X_i} , {v^L} ) - {w_R}Gini ( {X_i} , {v^R} ) \tag{3} \end{align*} \end{document}

The formula represents the difference between the impurity and the weighted average of impurities at each node of v. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{w}}_L}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{w}}_R}$$ \end{document} denote the ratio of samples falling into left and right child, respectively. At each node, a random set of mtry features out of N is evaluated, and the gene with the maximum \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{Gain}} \left( {{X_i} , v} \right)$$ \end{document} is used for splitting the node v ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{mtry}} = \sqrt N$$ \end{document} ).

The importance score for gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_i}$$ \end{document} can be calculated as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \mathop { \rm Im } \nolimits } { X_i } = \frac { 1 } { { ntree } } \sum \limits_ { v \in { S_ { { x_i } } } } { Gain ( { X_i } , v ) } \tag { 4 } \end{align*} \end{document}

3.3.2. GeneRank method

GeneRank (Morrison et al., 2005) is an intuitive modification of Google PageRank algorithm (Brin and Page, 1998). GeneRank provides an alternative method to calculate a rank for each node by combining prior knowledge into the PPI network. This study attempted to use the GeneRank algorithm for gene selection. The GeneRank method includes several steps as given below.

Let the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$G{ \rm{ = \{ }}{{ \rm{X}}_1}{ \rm{ , }}{{ \rm{X}}_2}{ \rm{ , }} \cdots { \rm{ , }}{{ \rm{X}}_N}{ \rm{ \} }}$$ \end{document} be the N genes on a microarray. Then, the GeneRank method attributed to the solution of the equation. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} ( { \bf{I}} - d \cdot {{ \bf{W}}^T}{{ \bf{D}}^{ - 1}} ) {{ \rm{r}}^*} = ( 1 - d ) \cdot { \bf{ex}} \tag{5} \end{align*} \end{document}

where 0 < d < 1 is the damping factor. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm d} \to 0$$ \end{document} , then results in a ranking mostly affected by the fold change information, whereas \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm d} \to 1$$ \end{document} corresponds to a ranking that is more dependent on the network structure. Morrison et al. (2005) suggested using d = 0.5. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \bf{ex}} = { [ { \rm{e}}{{ \rm{x}}_{1 , }}{ \rm{e}}{{ \rm{x}}_2} , \cdots , { \rm{e}}{{ \rm{x}}_{ \rm{N}}} ] ^{ \rm{T}}}$$ \end{document} ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{e}}{{ \rm{x}}_{ \rm{i}}} \ge 0 , { \rm{i}} = 1 , 2 , \cdots , { \rm{N}}$$ \end{document} ) is the absolute value of expression change for gene set G. W is the adjacent matrix of G, with elements \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{w}}_{{ \rm{ij}}}} \in \left\{ {0 , 1} \right\} $$ \end{document} . Indeed, W is symmetric as the network is undirected, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \bf{W}}^T} = { \bf{W}}$$ \end{document} . Then, we define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{de}}{{ \rm{g}}_{ \rm{i}}}$$ \end{document} to be the degree of i-th gene in the network, D = diag ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{de}}{{ \rm{g}}_1} , { \rm{de}}{{ \rm{g}}_2} , \cdots , { \rm{de}}{{ \rm{g}}_{ \rm{N}}}$$ \end{document} ) to be a diagonal matrix. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \deg _i} = \mathop \sum \limits_{j = 1}^N {{w_{ij}}} \tag{6} \end{align*} \end{document}

The rank of gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_i}$$ \end{document} was obtained according to the results \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_i^*$$ \end{document} (i = 1, 2 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cdots , { \rm{N}}$$ \end{document} ) of the Equation (5). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{ \rm{R}}_{{X_i}}}$$ \end{document} was defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { R_ { { X_i } } } = \frac { 1 } { { rank ( r_i^* ) } } \tag { 7 } \end{align*} \end{document}

where the rank method is sorted in descending order. As \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_i^*$$ \end{document} increases, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${R_{{X_i}}}$$ \end{document} increases and the feature becomes more important. The transformation was done by Equation (7) to avoid single genes having a weight \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_i^*$$ \end{document} that is extremely larger than others. Note that we identically set the minimal GeneRank result \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_i^*$$ \end{document} for all unknown prior knowledge genes.

The fold-change information and PPI matrix were used to calculate a ranking ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${R_{{X_i}}} \in \{ 1 , 2 , 3 , \ldots , N \} $$ \end{document} ). The importance score \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{Im}}{X_i}$$ \end{document} (i = 1, 2, …, N) can be obtained for each gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_i}$$ \end{document} . The \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{Im}}{X_i}$$ \end{document} was used to evaluate the contribution of features regarding predicting the classes.

3.3.3. Feature ranking criterion

By combining the two sorting methods, we get a new gene scoring method called NR, which is defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm{N}}{{ \rm{R}}_{ \rm{X}}}_{_{ \rm{i}}}{ \rm{ = }}{{ \rm{R}}_{ \rm{X}}}_{_{ \rm{i}}}{ \rm{^*}Im}{{ \rm{X}}_i} \tag{8} \end{align*} \end{document}

This ranking method takes both the impact of a particular gene in classification algorithm and the connectivity of the gene in the underlying biological network into account, which can significantly reflect all the influences from the two parts. We iteratively fit the RF classifier and record each out-of-bag (OOB) error rate and then discard 10% genes with the smallest NR. As ∼70 genes have been previously identified as the markers for breast cancer metastasis, the threshold value was set as 50 to control the number of genes in the subset. After fitting all forests, we choose the smallest number of genes with a low OOB error rate.

3.4. Performance evaluation

The classification accuracy of the methods was assessed by two criteria: the area under precision–recall curve (AUPR) and the area under ROC curve (AUC). AUC is a composite indicator that reflects the sensitivity and specificity of continuous variables. When the AUC value approaches more to 1, it indicates the better classification. However, AUC criterion is magnitudes of ordering more negative than positive examples. So, we chose the P-R curve, which is the plot of the ratio of true positives among all positive predictions for each given recall rate. The AUPR is the area under P-R curve, and when AUPR value approaches more to 1, it indicates the better classification. Based on AUC value and AUPR value, the prediction accuracy of PPIRF algorithm is evaluated by 10-fold CV tests. In a 10-fold CV, the training data set was spilt into 10 subsets, where 1 of the data set was used as the test set while other 9 subsets were used for training the classifier. The trained classifier was tested by the test set. The process is repeated 10 times and takes a mean accuracy to get a fair evaluation.

4. Results and Discussion

4.1. The performance of breast cancer metastasis prediction

To illustrate the effect of the method, we chose VarSelRF (Díaz-Uriarte and De Andres, 2006) and RRFE (Johannes et al., 2010) for comparison. Both methods have been reported to have the ability to pick out the subset of genes for significant effect classification with a high accuracy. First, VerSelRF is the approach for gene selection using RF 10 genes, which is the same as part of our PPIRF. Díaz-Uriarte showed that the VarSelRF returns very small sets of genes compared to other variable selection methods, while retaining predictive performance. However, the VerSelRF does not consider the role of PPI information for the gene selection. Therefore, we chose the VerSelRF method for comparison to illustrate the importance of PPI information. Second, PPIRF was compared with the RRFE, which is based on the support vector machine (SVM) algorithms with the pathway knowledge. With the pathway knowledge, the RRFE method outperformed most of gene selection. So, we chose the RRFE as the other contrast algorithm.

Table 2 shows the AUC and AUPR values of three methods on the four data sets. It is obvious the AUC and AUPR values of PPIRF method have been significantly increased than the other two methods in the three data sets, except for GSE2034 fairly. Our empirical comparison with the two methods demonstrates that PPIRF is a promising approach to gene feature selection from high-dimensional gene expression data. Figure 2 shows the ROC and PR curve of three methods in four data sets.

FIG. 2.

ROC curve and PR curve of three methods in four data sets.

Table 2.

Comparison of Area Under ROC Curve and Area Under Precision–Recall Curve with Three Methods

	AUC (SD)			AUPR (SD)
	VarSelRF	RRFE	PPIRF	VarSelRF	RRFE	PPIRF
GSE6532	0.5191 (0.109)	0.5681 (0.028)	0.6723 (0.030)	0.2362 (0.062)	0.3327 (0.016)	0.3603 (0.013)
GSE7390	0.5113 (0.137)	0.5356 (0.036)	0.5918 (0.025)	0.3167 (0.069)	0.3248 (0.028)	0.3769 (0.020)
GSE2034	0.5661 (0.225)	0.7226 (0.041)	0.7109 (0.035)	0.4055 (0.075)	0.5726 (0.032)	0.5869 (0.029)
GSE11121	0.6302 (0.103)	0.6063 (0.015)	0.6808 (0.021)	0.3766 (0.053)	0.2891 (0.011)	0.4354 (0.014)

AUC, area under ROC curve; AUPR, area under precision–recall curve; PPI, protein–protein interaction; RF, Random Forest; SD, standard deviation.

4.2. Comparison with different data sets

According to the AUC and AUPR in Table 2, the PPIRF algorithm used with the four data sets has been proven to perform better than other methods. The result of GSE11121 presented the most significant improvement. Therefore, we chose the GSE11121 as the training data to pick out the 74 genes. To illustrate the ability of the 74 genes to predict breast cancer metastasis, we use them as the feature set for the four breast cancer data sets. We chose RF as the classifier and use AUC and AUPR values for evaluating classification performance. Figure 3 shows the classification result of four data sets based on 74 genes. As shown in Figure 3, the AUC values of GSE7390, GSE2034, and GSE11121 were greater than 0.6, and AUPR values were greater than 0.3. It means that the description of the 74 genes in the different data set was able to get a better classification result.

FIG. 3.

The ROC curve and PR curve of four data sets based on 74 genes (100 probes).

4.3. Predicted genes related with breast cancer metastasis

Table 3 lists the top 10 genes that are related with breast cancer in the 74-gene list. For instance, CD44 has been found to be upregulated in breast cancer, which is critical in tumor progression and metastasis (Anand and Kumar, 2014). SRC, TP53, and CDKN2A are already known to be metastasis related (De Mattos-Arruda et al., 2014; Fernandez et al., 2014; Sänger et al., 2014; Xiao et al., 2014). The results demonstrated that the gene list has strong predictive power and biological significance.

Table 3.

The Top 10 Genes That Validated by PubMed in 74-Gene List

ID	Gene symbol	Gene ID	Chosen times	PubMed hits
208305_at	PGR	5241	15	1609
213324_at	SRC	6714	21	1408
201746_at	TP53	7157	30	1201
210916_s_at	CD44	960	27	1067
205051_s_at	KIT	3815	11	837
207039_at	CDKN2A	1029	15	512
214732_at	SP1	6667	13	461
207004_at	BCL2	596	13	401
211627_x_at	ESR1	2099	45	367
202095_s_at	BIRC5	332	20	342

In addition, we analyzed the biological pathways associated with the 74 genes according to the Reactome database (www.reactome.org/). The 20 pathways with the smallest false discovery rate (FDR) are shown in Table 4. Although 21 of the 74 genes have unknown pathways, several pathways associated with cancer were identified, such as signaling by ERBB2 and loss of function of SMAD2 in cancer. It means the method of gene selection can return sets of genes that are highly correlated.

Table 4.

Pathway Analysis of the 74 Genes from the Prognostic Signature

Pathway	Genes related with the pathways
Nuclear receptor transcription pathway	ESR1; PGR; NR3C1
Oncogene-induced senescence	TP53; SP1; CDKN2A; CDKN2A
VEGFA-VEGFR2 pathway	CRK; RAF1; FYN; SRC; JAK1; MAPK14; PRKCA; CDK1; PRKACA; HSP90AA1; HSP90AA1; RAF1
TGF-beta receptor signaling activates SMADs	TGFBR1; SMAD4; SMAD3; PPP1CA; SMAD2
Downstream signal transduction	CRK; GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; RAF1
Hormone-sensitive lipase (HSL)-mediated triacylglycerol hydrolysis	PPP1CA; PRKACA
Loss of function of SMAD2/3 in cancer	TGFBR1; SMAD4; SP1; SMAD3; PPP1CA; SMAD2; CDKN2A
Insulin-like growth factor-2 mRNA binding proteins (IGF2BPs/IMPs/VICKZs) bind RNA	CD44
Downregulation of TGF-beta receptor signaling	TGFBR1; SMAD3; PPP1CA; SMAD2
Signaling by PDGF	CRK; GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; RAF1
Signaling by ERBB2	GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; HSP90AA1; HSP90AA1; RAF1
Signaling by ERKs	CRK; GRB2; RAF1; SRC; JAK1; MAPK14; CDK1; RAF1
NGF signaling via TRKA from the plasma membrane	CRK; GRB2; KIT; RAF1; FYN; SRC; JAK1; MAPK14; PRKCA; CDK1; PRKACA; RAF1
Hemostasis	GRB2; RAF1; SRC; CREBBP; PRKCA; GNA13; EP300; RAF1; F2RL2; ACTN2; CREBBP; TP53; CRK; RACGAP1; FYN; JAK1; MAPK14; CENPE; PRKACA; CD44
Signaling by EGFRvIII in cancer	GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; HSP90AA1; HSP90AA1; RAF1
Signaling by RAS	GRB2; RAF1; SRC; JAK1; MAPK14; CDK1; RAF1
Signaling by EGFR in cancer	GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; HSP90AA1; HSP90AA1; RAF1
Signaling by interleukins	CRK; GRB2; RAF1; FYN; JAK1; IKBKB; CDK1; PRKACA; RAF1
Cell cycle, mitotic	AURKA; CSNK2A1; TOP2A; CREBBP; PRKCA; EP300; CDKN2A; HSP90AA1; CDC20; CENPF; CENPE; PPP1CA; CDK1; CENPF; CCNA2; PRKACA; CDKN2A; HSP90AA1; BIRC5
Signaling by FGFR	GRB2; KIT; RAF1; FYN; SRC; JAK1; PRKCA; CDK1; PRKACA; RAF1

4.3.1. Survival analysis of the predicted genes

The SurvExpress is a web-based tool that provides survival analysis and risk assessment of cancer data sets. We applied survival analysis and risk assessment of breast cancer data sets (e.g., GSE2034) by the SurvExpress (Aguirre-Gamboa et al., 2013).

Figure 4 shows the Kaplan–Meier plot by risk group, red and green curves denoting high- and low-risk groups, respectively. The title of this plot shows the Concordance Index (CI), which is a generalization of the AUROC used in classification problems. The CI values approaching to 0.5 mean putatively “random,” whereas higher values mean better prediction. In addition, the p-value of log-rank tests for the differences of the two groups is smaller than 0.05. The p-value shows that the 74 genes can distinguish two risk groups significantly. An estimation of the hazard ratio by the Cox proportional hazards model represents the relative risk between two risk groups.

FIG. 4.

Kaplan–Meier curves and performance of the 74 genes as biomarkers in the GSE2034 data set. Low and high risks are drawn in green and red, respectively.

In addition, the heat map was further used for the survival analysis (Fig. 5). According to predictions, the sample was divided into high-risk and low-risk groups. Correspondingly, the truth is that 179 patients did not develop distant metastasis and others developed distant metastasis. By comparing the actual group and forecast group, we can see that 74 genes can obtain a good performance in survival risk prediction.

FIG. 5.

Heat map shows the expression of 74 genes (rows) along GSE2034 samples (columns) in risk groups.

5. Conclusion

Metastasis is one of the main causes of death for breast cancer patients. Picking out the subsets of the most relevant feature from high-throughput gene expression data is critical to predict distant metastasis of breast cancer. This study presented a new PPIRF algorithm to select the relevant genes for breast cancer metastasis with a high accuracy. The main contribution of this study lies in data integration via the use of different gene selection methods. The experiment, using four data sets of breast cancer, proves that the proposed gene ranking technique is capable of producing better classification performance over those that were produced by earlier methods. In addition, incorporating PPI information into the gene selection process can make the gene list more biologically explanatory. Theoretically, the proposed algorithm can be applied in any similar situation of gene selection. It is believed that the method will potentially provide more reliable functional linkage information and allow us to discover functional relationships between proteins more accurately and efficiently. However, the proposed algorithm applied on breast cancer cannot guarantee to be applicable in all other cases of gene selection. For the next step, we will test the method on more cases to improve its applicability and to further improve the algorithm from the various cases.

Footnotes

Acknowledgments

This work was partially supported by the National Basic Research Program of China (Grant No. 2012CB910400), the National Natural Science Foundation of China (Grant No. 81330059), the National Major Scientific and Technological Special Project for “Significant New Drugs Development” (2013ZX09507001), the National Science and Technology Support Plan Project (2015BAH12F01), and the Science and Technology Commission of Shanghai Municipality (14DZ2270100).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Aguirre-Gamboa

, et al. 2013. SurvExpress: An online biomarker validation tool and database for cancer gene expression data using survival analysis. PLoS One, 8, e74250.

Anand

M.T.

, and Kumar

2014. CD44: A key player in breast cancer. Indian J. Cancer, 51, 247–250.

Brin

, and Page

1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30, 107–117.

Breiman

2001. Random forests. Machine Learn. 45, 5–32.

Barrett

, Troup

D.B.

, Wilhite

S.E.

, et al. 2007. NCBI GEO: Mining tens of millions of expression profiles—Database and tools update. Nucleic Acids Res. 35, D760–D765.

Binder

, and Schumacher

2009. Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinformatics, 10, 18.

Chen

, and Deem

M.W.

2013. Hierarchy of gene expression data is predictive of future breast cancer outcome. Phys. Biol., 10, 056006.

Chuang

H.Y.

, Lee

, Liu

Y.T.

, et al. 2007. Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3, 140.

Díaz-Uriarte

, and De Andres

S.A.

2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 7, 3.

10.

De Mattos-Arruda

, Weigelt

, Cortes

, et al. 2014. Capturing intra-tumor genetic heterogeneity by de novo mutation profiling of circulating cell-free tumor DNA: A proof-of-principle. Ann. Oncol., 25, 1729–1735.

11.

Fernandez

S.V.

, Bingham

, Fittipaldi

, et al. 2014. TP53 mutations detected in circulating tumor cells present in the blood of metastatic triple negative breast cancer patients. Breast Cancer Res., 16, 445.

12.

Garcia

, Millat-Carus

, Bertucci

, et al. 2012. Interactome–transcriptome integration for predicting distant metastasis in breast cancer. Bioinformatics, 28, 672–678.

13.

Irizarry

, et al. 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264.

14.

Liaw

, and Wiener

2002. Classification and regression by random forest. R News, 2, 18–22.

15.

Morrison

J.L.

, Breitling

, Higham

D.J.

, et al. 2005. GeneRank: Using search engine technology for the analysis of microarray experiments. BMC Bioinformatics, 6, 233.

16.

Menze

B.H.

, Kelm

B.M.

, Masuch

, et al. 2009. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 10, 213.

17.

Ruan

, Dean

A.K.

, and Zhang

2010. A general co-expression network-based approach to gene expression analysis: Comparison and applications. BMC Syst. Biol., 4, 8.

18.

Strobl

, Boulesteix

A.L.

, Zeileis

, et al. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25.

19.

Strobl

, Malley

, and Tutz

2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods, 14, 323.

20.

Sänger

, Ruckhäberle

, Bianchini

, et al. 2014. OPG and PgR show similar cohort specific effects as prognostic factors in ER positive breast cancer. Mol. Oncol., 8, 1196–1207.

21.

Johannes

, Brase

J.C.

, Fröhlich

, et al. 2010. Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients. Bioinformatics, 26, 2136–2144.

22.

Prasad

T.S.K.

, Goel

, Kandasamy

, et al. 2009. Human protein reference database—2009 Update. Nucleic Acids Res. 37, D767–D772.

23.

Van De Vijver

M.J.

, He

Y.D.

, van't Veer

L.J.

, et al. 2002. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347, 1999–2009.

24.

Van't Veer

L.J.

, Dai

, Van De Vijver

M.J.

, et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536.

25.

Wang

, Klijn

J.G.M.

, Zhang

, et al. 2005. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365, 671–679.

26.

Wang

, Xu

, San Lucas

F.A.

, et al. 2013. Incorporating prior knowledge into Gene network study. Bioinformatics, 29, 2633–2640.

27.

Wang

, Lo

S.H.

, Zheng

, et al. 2012. Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics, 28, 2834–2842.

28.

Weigelt

, Hu

, He

, et al. 2005. Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. Cancer Res. 65, 9155–9158.

29.

Xiao

, Chen

, Liu

, et al. 2014. Diallyl disulfide suppresses SRC/Ras/ERK signaling-mediated proliferation and metastasis in human breast cancer by up-regulating miR-34a. PLoS One, 9, e112720.