Abstract
Cancer cells have anomalous development and proliferation due to disturbances in their control systems. The study of the behavior of cellular control system requires high-throughput dynamical data. Unfortunately, this type of data is not largely available. This fact motivates the main issue of this article: how to use static omics data and available biological knowledge to get new information about the elements of the control system in cancer cells. Two important measures to access the state of the cellular control system are the gene expression profile and the signaling pathways. This article uses a combination of these two static omics data to gain insights on the states of a cancer cell. To extract information from this kind of data, a statistical computational model was formalized and implemented. In order to exemplify the application of some aspects of the developed conceptual framework, we verified the hypothesis that different types of cancer cells have different disturbed signaling pathways. To this end, we developed a method that recovers small protein networks, called motifs, which are differentially represented in some subtypes of breast cancer. These differentially represented motifs are enriched with specific gene ontologies as well as with new putative cancer genes.
Introduction
Cancer cells have anomalous development and proliferation due to disturbances in their control systems. To increase the knowledge about these phenomena, some quantitative models and experimental studies have been proposed either in an abstract theoretical perspective1,2 or in some particular case studies.3,4 The cell control systems are constituted by the integration of a gene system and signaling pathways that regulate gene expression and metabolic pathways. The signaling pathways are composed of a biochemical network that has a hierarchical modular structure, that is, many complex networks are decomposed in simpler ones, the modules or motifs. 5
A very informative data for studying such complex phenomena would be dynamical data evaluated through time series of omics measurements taken from a single cell or from a set of synchronized cells. Unfortunately, this kind of data is not largely available. This fact motivates the main issue of this article: how to use static omics data and available biological knowledge to get new information about the elements of the control dynamics of cancer cells.
Two important measures to access the state of the cell control system are the gene expression profile and the signaling pathways (ie, control signals built primarily from protein-protein interaction [PPI]). Tumor biopsies data from large cohorts of patients are abundant (eg, the one from The Cancer Genome Atlas) and allow the assessment of gene expressions, which are important to evaluate the impact of genetic heterogeneity in the control systems. This information may be relevant, for example, to distinguish the control systems’ behavior in different types of cancer or in different stages of a given cancer type. Another available data that has some structural information about the cell control system dynamics is the human PPI network.
This article uses a combination of two static omics data: the human PPI network 6 and the transcript expression measurements taken from breast cancer tissues 7 to model the states of a cancer cell. The integration of biological networks and gene expression data is not new. Several reports have addressed the combined use of both types of data using different approaches.8–14 For each type of cancer, this integrated approach results in a subgraph of the human PPI graph, labeled with relative (with respect to the normal tissue) transcript expression values associated to each protein. This labeled PPI graph connects the problem of extracting information about cancer cell dynamics from static omics data and complementary biological knowledge to the formal approach of probabilistic graph models, such as gene expression network 15 and motif network analysis. 16 This network analysis models the phenomena of different kinds by the same abstract model, a graph. It takes abstract graph structure measures, which depend on nodes and arcs, and interprets them in the context of the phenomena studied. The labeled subgraphs under this study will be called as motifs.
To exemplify the application of some aspects of the developed conceptual framework, we will verify the hypothesis that different subtypes of breast cancer have different representation of motifs. This article proposes an analytical method, which integrates PPI networks and differential tissue transcript expression data to identify motifs (ie, small subgraphs) of disturbed signaling pathways and measure the distribution of the observed state of these motifs in breast cancer subtypes. Motifs showing a differential representation in the subtypes are then analyzed, and new cancer genes are proposed.
Following this section, the “Materials and methods” section presents the methods used to extract, analyze, and model the data. The description of the data includes their source, semantics, and preparation. The “Model of analysis” section presents the mathematical description of the operators used to identify the motifs, estimate the motif distributions, and recognize the characteristic motifs of each cancer subtype. In the following “Results” section, the analysis outputs are given, including identification of motifs, characteristics of motif recognition, data enrichment of the motifs recognized, analysis of motifs internal connections for annotation of new cancer proteins, and corresponding enrichment. The article is ended with the discussion of potential applications of the technique presented here and other similar techniques for learning about the structure of the cellular control system and the disturbed subsystems.
Materials and Methods
Data source
In this subsection, we describe the TGA expression data, the PPI data and the enrichment technique.
TCGA data
The TCGA expression data were retrieved from the cBio portal. A z-score threshold was used to classify the genes as upregulated (z-score >3) or downregulated (z-score <–3). Moreover, to improve the number of available data, both RNA-Seq and microarray experiments were combined. For samples with both types of data, only microarray data were used. This data were acquired using the CGDS-R function, available at cBio. 17 Datasets were all from the same release (January 2015). TCGA clinical data were downloaded from the TCGA public site (http://tcgs-data.nci.nih.gov/tcga/tcgaDownload.jsp). The breast cancer samples were stratified using the molecular signatures described in Refs. 18 and 19.
PPI data
The human PPI network was obtained from the STRING database, which includes both known and predicted PPIs. 20 For each association, a confidence score is calculated based on the evidence collected from different data types. Here, we selected interactions with score 700 or 900 from experimental or in silico evidences, respectively.
Enrichment data
The protein datasets derived from the analyzed motifs were submitted to clusterProfiler package, a enrichment tool provided by the R platform. 21 All enrichments were carried out against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 22 with an adjusted (using the BH method) P-value cutoff equal to 0.05.
Model of analysis
In this subsection, we describe the computational methods developed in this article. First, we introduce some concepts and definitions. Second, we describe the adopted strategy for the motif identification. Finally, we present our methodology to assess the frequencies of the motifs that were obtained in the identification step.
Definitions
A graph is an ordered pair of sets (
Definition 2.1. A PPI network graph Vv is a graph (
A subgraph is a graph
Definition 2.2. Given a subgraph G of PU, Genes(G) is an ordered list of genes associated to the vertices of G.
In this work, gene expression information is gathered from cancer tissue subtypes. For each subtype, several samples were collected. For each subtype and each sample, the expression levels of a given gene are given by its z-score, which is a real number.
Definition 2.3. Given a gene g, z(g,l,i) is the z-score of g for the ith sample of subtype 1.
Let the elements of the set {–1,0,1} be denoted as labels. The labels –1, 0, and 1 are also known as suppressed, normal, and overexpressed, respectively. We define f: ℝ ↠ {–1,0,1} as a function that takes values from z-scores to labels.
Now, let G be a graph with at least two vertices. We say that G is connected if for every pair (u,v) of vertices of G there is a set of edges

Connected graphs with four vertices.
Let G and H be two graphs. We say that G is isomorph to H if there is a bijection b between the vertices sets of G and H such that two vertices u and v of G are adjacent (ie, both are present in the same edge of G) if and only if the vertices b(u) and b(v) of H are adjacent.
With the definitions presented so far, we can now introduce the concept of motif used in this article.
Definition 2.4. A motif 〈G,L〉 is composed by a connected graph G and an ordered list of labels L such that there is a bijection between vertices in G and labels in L.
In Figure 2, we present an example of motif. Finally, let

Example of a motif (G,L) with vertices {a,b,c,d} and L = <+1,0,+1,–1>.
Based on these definitions, we developed algorithms for motif identification that will be explained in the next section.
Motif identification
In this section, we present two algorithms for motif identification, which is, for a given set of motifs, the search and counting of their occurrences in the PPI network graph, cancer tissue subtypes, and their respective samples. The first algorithm is a basic search and count algorithm, which is described in a higher level, that is, we show only the general steps to perform the motif identification. The second algorithm is a modification of the basic search and count algorithm for connected graphs with four vertices, which we further explored in this work with computational experiments.
The basic search and count algorithm
The basic search and count algorithm receives as input a set of cancer tissue subtypes and their respective set of samples, an integer k ≥ 2 and a PPI network graph PU. It returns a table Count, which stores, for each subtype and each sample, the number of observations of the motif

Pseudocode of the basic search and count algorithm.
In this basic search and count algorithm, we assume that the counting issues that might arise due to symmetries between motifs are solved through an appropriate implementation of the for loops in the lines 2–10, 3–9, and 5–7. In the following, we show an implementation of this algorithm for connected graphs with four vertices; this implementation includes a way to circumvent these counting issues.
An algorithm for connected graphs with four vertices
The general strategy of this algorithm is, for each vertex in the PPI network graph PU, to explore its adjacent vertices (and also some vertices adjacent to adjacent vertices) and store all connected subgraphs with four vertices found during that procedure. For each search initiated in a vertex v, the exploration is constrained to vertices smaller than v. Hence, we assume that there is a strict total order relation in the set of vertices of PU, as in the graph depicted in Figure 4.

Example of PPI network graph PU.
We start the description of this algorithm by a recursive function to identify the connected subgraphs of four vertices in a PPI network graph PU. To this end, we first introduce an additional concept: let G be a graph and X be a subset of vertices of G. G[X] is a subgraph of G induced by X if G[X] is a subgraph of G such that for every pair of vertices u and v of G [X], {u,v} is an edge of G[X] if and only if {u,v} is an edge of G.
The IDENTIFY-SUBGRAPHS function receives a PPI network graph PU and a vertex v of PU, and includes into a given set G all connected subgraphs of four vertices in PU that can be found in an exploration starting in v. In an initial call of this function, X is set as {v}, and k = 3. The pseudocode of this function is shown in Figure 5.

Pseudocode of a recursive function to identify the subgraphs from a PPI network graph.
In Table 1, we show a simulation of the IDENTIFY-SUBGRAPHS function for an exploration of the PPI network graph depicted in Figure 4 that starts in vertex 7. The main algorithm receives as input a set of cancer tissue subtypes and their respective set of samples and a PPI network graph PU. It returns a table Count which stores, for each subtype and each sample, the number of observations of the motif

Pseudocode of a search and count algorithm for connected graphs with four vertices.
Simulation of the IDENTIFY-SUBGRAPHS function.
After the execution of the 19th and last recursive call.
In the pseudocode shown in Figure 6, the function EXTRACT-TOPOLOGY called in line 11 avoids two motifs that are symmetric to each other of being counted separately. Once for connected graphs with four vertices, there are a few possible configurations when the graph isomorphism is taken into account. Figure 1 shows the six equivalent classes defined by all isomorphisms; we treated each case directly through a table that maps different pairs

A procedure to avoid two motifs that are symmetric to each other of being counted separately. At each figure, vertices names and their associated labels are, respectively, outside parentheses and between parentheses. The motifs of (A) and (B) are symmetric to each other, and the EXTRACT-TOPOLOGY function maps them to the same unique motif (C).
Motif frequency analysis
In this section, we describe the methodology of the analysis of the motif statistics produced using the algorithms presented in the previous section (ie, the data stored in the Count table). The adopted strategy involves the use of the Shannon entropy to assess motifs whose number of occurrences was concentrated in one or more subtypes.
Let
Results
The PPI graph used for this work has 7335 nodes, with degrees (ie, number of connections) varying from 1 to 3090 (for the protein UBC), with an average degree of 8. It is important to emphasize that present-day human PPI networks are still incomplete with a high rate of false negatives. We deal with this problem by using mass concentration analysis of relevant motifs. The breast cancer subtypes considered are presented in Table 2.
Number of samples in each breast cancer subtype.
Motif identification
In the following, a motif is just a subgraph where a vertex is associated with any protein with a given label. Consequently, the number of different motifs is 6 × 3 4 = 486, since there are six topologies (Fig. 1) and three possible labels for each vertex.
As expected, the vast majority of motifs have labels L = (0,0,0,0), indicating that all proteins have no differential expression compared to a normal cell in that respective sample. Their count is over 95 times the number of occurrences of any other labeled motif in a given subtype. In order to reduce the search space, we removed these motifs from further analysis.
In the following, for the sake of simplicity, we use the letters s, n, and o to indicate the expression levels –1 (suppressed), 0 (normal), and 1 (overexpressed), respectively.
Differential motifs
The counting algorithm of the “Motif identification” section and Expression (2) provide an estimate of motif distribution among breast cancer subtypes. Using Shannon's entropy (Expression (1)), we identified the motifs that are more represented in each breast cancer subtype.
With four subtypes, the maximum entropy is –4 × 0.25 log, 0.25 = 2, representing an uniform distribution. Table 3 shows all the motifs with entropy lesser than one, indicating a concentration of occurrences in one or two subtypes. We are interested in those that are strongly represented in one subtype and have a small relative number of occurrences in others.
Entropy of motifs and distribution among subtypes.
For instance, the first motif in Table 3, K4-osss (topology K4 and expression levels
Enrichment of differential motifs with specific ontologies in breast cancer subtypes
The differential motifs identified in Table 3 were further analyzed in terms of the composition of proteins in the PPI subgraph constructed for each breast tumor subtype. From the samples of each breast cancer subtype, we took the union of all instances of a motif, obtaining a set of proteins. These non-redundant lists (one for each subtype) were then evaluated regarding the enrichment of ontologies from the KEGG. 22 A total of 27 and 12 KEGG categories were exclusively present in TNBC and luminal A, respectively (Fig. 8). For luminal A, Kite-ssos was the most enriched motif in the KEGG categories as shown in Figure 8, while for TNBC, Linear-osoo was the most enriched motif.

KEGG enrichment analysis. The dotchart shows the enrichment analysis performed by using clusterProfiler. All motifs selected by Shannon entropy method are represented at the X-axis, while in the Y-axis, all the enriched KEGG pathways with P-value ≤ 0.05 are listed. The adjusted P-values are sorted from less (blue) to more (red) significant. Furthermore, the dot size is based on gene ratio, which is the observed number of genes in the experimental set within the respective KEGG pathway.
Interestingly, most of the KEGG categories found exclusively in TNBC are involved with tumorigenesis, including transcription misregulation in cancer, renal cell carcinoma, PI3K-AKT signaling pathway, pathways in cancer, mismatch repair, base excision repair, and microRNAs in cancer. This strongly suggests that our method is capable to identify motifs whose nodes are related to tumorigenesis. Furthermore, this also shows that the motifs identified by our methods belong to pathways related to cancer.
Use of motifs to complement the annotation of cancer pathways
One of the most complete annotations of cancer-related pathways is provided by the KEGG database. Its pathways in cancer entry have been widely used as a comprehensive landmark for a systemic view of cancer genes and proteins.
As mentioned earlier, the data in Figure 8 strongly support the view that the motifs identified by our method belong to cancer-related pathways. It is, thus, reasonable to speculate that the set of proteins belonging to our motifs might harbor new putative cancer genes. To evaluate this possibility, we first selected all instances of the motifs that contained at least one node from KEGG's pathways in cancer entry. Among these instances, half (51%) contained only one node from that respective KEGG entry. Instances with two, three, and four nodes from the KEGG entry corresponded roughly to 33%, 14%, and 2% of all instances, respectively. There was no difference between the breast cancer subtypes regarding the above distribution. Next, we select all other proteins belonging to the same motifs but not belonging to the KEGG's pathways in cancer. Proteins classified as overexpressed or underexpressed were separately ranked based on their frequency in the motifs.
The top candidates are listed in Table 4. Using the S-score method, recently developed by some of us, 23 we estimated that 52% of the genes listed in Table 4 (54 out of 103 genes) are cancer genes for breast tumor. One interesting protein that our method classifies as potentially oncogenic for triple negative is PSMB4, a subunit of the 26S proteasome that has been classified as a driver oncogene for several types of tumors. 24 Another interesting protein found by our method is YEATS2, a component of a histone acetyltransferase complex. YEATS2 has been found to be recurrently altered in chondrosarcoma tumors. 25
The ranked proteins associated with pathway in cancer.
In the list of underexpressed proteins, the transcription factor GATA3 was identified by our method as important in triple negative, and several reports have linked alterations in GATA3 with different aspect of breast tumor biology. 26 The data presented in Table 4 strongly suggest that our method is able to identify known and new cancer genes.
Discussion
The method presented in this study identifies the recurrent four-node motifs in the human PPI network superimposed with expression data from breast cancer patients from the TCGA project. The PPI graph permits to assess regions of the four-dimensional gene expression space and, based on gene vector distribution, recognize the differential gene vectors. The S-score method corroborated that most of the genes found are oncogenes for breast cancer tumor. Furthermore, up to a reasonable noise level in the PPI network, our method is quite robust, since we have considerable amount of data for the estimation of this kind of distribution (ie, the ones with very high mass concentration). Although breast cancer data were used here, the method can be used for any type of tumor, assuming that there are the corresponding gene expression data. Besides generating a representation map of such motifs in a cohort of breast cancer patients, the method allowed us to gain significant biological insights and identify potential new cancer proteins. Both pieces of information extracted by the method are relevant to increase our understanding of the structure and dynamics of the cancer cell control system. The identification of new cancer proteins tends to progressively complete the architecture of the signal network of cancer cells. Changes in the cancer cell architecture are usually associated with dynamics alterations. The distribution of motifs may allow the investigation of cell control system architecture transformations, which imply in increasing or decreasing the control disturbances in different generations of cancer cells. The effect of such alterations on the cell control systems should be associated with several phenotypes, including cancer malignancy. Thus, under this point of view, methodologies of the kind presented in this article may contribute to the discovery of relevant new knowledge about cancer cell control systems.
Author Contributions
Conceived and designed experiments: SJS, JB, MDG, MR. Data generation: AF, MDG, MR. Analyzed the data: MDG, JB, MR. Biological analysis: SJS, AF. Wrote the first draft: JB, SJS, MDG, MR. Contributed to the writing of the manuscript: JB, SJS, MR, MDG. Agree with manuscript results and conclusions: All authors. Jointly developed the structure and arguments for the paper: JB, SJS, MDG. Made critical revisions and approved final version: JB, SJS, MDG, MR. All authors reviewed and approved of the final manuscript.
