Discovering frequent induced subgraphs from directed networks

Abstract

Directed networks find many applications in computer science, social science and biomedicine, among others. In this paper we propose a new graph mining algorithm that is capable of locating all frequent induced subgraphs in a given set of directed networks. We present an incremental coding scheme for representing the canonical form of a graph, study its properties, and develop new techniques for pattern generation suitable for directed networks. We prove that our algorithm is complete, meaning that no qualified pattern is missed by the algorithm. Furthermore, our algorithm is correct in the sense that all patterns found by the algorithm are frequent induced subgraphs in the given networks. Experimental results based on synthetic data and gene regulatory networks show the good performance of our algorithm, and its application in network inference.

Keywords

Apriori algorithm graph mining network inference structural pattern discovery

1. Introduction

Directed networks are graphs in which each node or vertex has a label, and each edge is directed from one vertex to another. Such graphs are used to model complex information or structures in a variety of applications. For example, in Internet engineering, the world wide web is modeled as a directed graph, in which each node represents a web page, and each edge is a hyperlink connecting two web pages [5]. In molecular biology, gene regulatory networks are represented by directed graphs, in which each node is a gene or transcription factor, and each edge represents a regulatory relationship between two genes [19]. In social network analysis, communications in a message board are modeled by a directed graph. In this graph, nodes represent individuals and edges represent the flow of influence, where the influence is calculated by considering the propagating terms in messages among individuals [21]. Figure 1 shows a directed network $G$ containing 5 nodes and 6 edges. Each node has a label represented by an English letter, and is associated with a number inside parentheses, where the number is randomly but uniquely assigned to distinguish the node from the other nodes in $G$ .

Figure 1.

Example of a directed network.

Graph mining aims to find frequent patterns in multiple graphs or networks.1

We use the terms “graph” and “network” interchangeably in the paper, and the term “network” refers to “directed network” when the context is clear.

As an extension of frequent itemset mining [1] and tree mining [32, 33, 34, 35], frequent pattern mining in graphs has been an active research topic. A variety of graph mining algorithms have been developed since the early 2000s [13, 17, 18, 23, 30]. Here we present a new algorithm, named FISHmine, for finding all Frequent Induced SubgrapHs in directed networks. Figure 2 shows an induced subgraph of the network

G

in Fig. 1. The subgraph in Fig. 2 has three nodes, and consists of all the edges of

G

with both end nodes in the subgraph. Note that, if we remove the edge from the node labeled

a(2)

to the node labeled

a(1)

, we would obtain a general subgraph of

G

that is not induced.

Figure 2.

An induced subgraph of the network in Fig. 1.

Discovering frequent induced subgraphs (FISHs) from directed networks is useful in a variety of applications. For example, many software tools have been developed to infer gene regulatory networks (GRNs) from gene expression data [19]. However, the accuracies of these tools are low. By finding common FISH patterns from the GRNs constructed by the existing software tools, one may be able to identify the true regulatory relationships between transcription factors and target genes. As another example, an influence diagram is a compact, directed graphical representation of a decision situation. In general, multiple influence diagrams can be obtained from multiple products. Finding common FISH patterns in multiple influence diagrams, which could be from the same or different manufacturers, may help identify common building blocks in making the products and also help identify common decision flows among the different manufacturers [24]. Below we review several methods that are closely related to our pattern finding algorithm.

1.1 Related work

AGM (Apriori-based Graph Mining) [13] is the first frequent subgraph mining algorithm that uses a pairwise-join based pattern growth strategy to generate frequent patterns. The algorithm employs the Apriori principle [1] and generates candidate subgraphs of size $(k+1)$ , i.e., $(k+1)$ -subgraphs, by joining two frequent $k$ -subgraphs that share the same $(k-1)$ -subgraphs. FSG [17] is another frequent subgraph mining algorithm, which is similar to AGM in that both algorithms grow patterns level by level through a pairwise join method. However, FSG grows patterns by edges while AGM grows patterns by vertices. FSG finds all frequent connected subgraphs in multiple undirected graphs.

The gSpan [30] program is arguably the most widely used tool for graph mining. The gSpan algorithm grows undirected connected subgraphs in a depth first search (DFS) manner by adding an edge to each possible position on the rightmost path of a known frequent subgraph. Specifically, the algorithm uses DFS lexicographic ordering to construct a tree-like lattice over all possible patterns, resulting in a hierarchical search space called a DFS code tree. Each node of this search tree represents a DFS code. This search tree is traversed in a DFS manner and all subgraphs with non-minimal DFS codes are pruned so that redundant candidate generations are avoided. With the assistance of the DFS code tree, gSpan can discover connected patterns from undirected graphs efficiently.

FFSM [10] focuses on graphs that are dense with a small number of labels. FFSM adopts the same canonical adjacency matrix (CAM) representation used by AGM. A tree-like structure, namely a suboptimal CAM tree, is constructed to include all possible patterns. Each node in this suboptimal CAM tree is created by either a join or an extension operation. FFSM maintains embedding lists for the discovered patterns to avoid explicit subgraph isomorphism testing in the support counting phase [15].

Gaston [23] is a unified graph, sequence and tree extraction algorithm. Given a database of graphs, Gaston finds all frequent subgraphs by using a level-wise approach in which simple paths are first considered, then more complex trees and finally the most complex graphs are considered. Consequently the subgraph mining procedure is invoked only when needed. To determine the frequency of a graph, Gaston employs an occurrence list based approach in which all occurrences of a small set of graphs are stored in main memory. Gaston also maintains embedding lists when growing patterns, to avoid unnecessary subgraph isomorphism testing.

The limitations of these existing methods are that they are mainly designed and implemented for finding frequent general, not induced, subgraphs in undirected networks. By contrast, our work focuses on finding frequent induced subgraphs in directed networks. In addition, our algorithm tackles both connected and disconnected subgraphs, which are common in social and biological networks as those networks are often sparse graphs. Because of these three properties (induced subgraphs, directed networks and disconnected subgraphs), our pattern growth procedure is different from those employed by the existing methods. Moreover, we incorporate a novel de-canonicalization component, which was not used in the existing methods, into our pattern mining process to ensure that no candidate pattern is missed, and all candidate patterns can be generated by our algorithm.

The rest of this paper is organized as follows. Section 2 presents basic concepts and terminologies. Section 3 describes in detail the FISHmine algorithm and shows the correctness and completeness of the algorithm. Section 4 presents experimental results as well as an application of FISHmine in gene network inference. Section 5 concludes the paper.

2. Preliminaries

We consider directed networks or graphs in which each node has a label and edges have a direction associated with them. Each graph has a finite number of nodes and edges. We allow different nodes to have the same label; cf. Fig. 1. The graphs have neither self loops (edges that connect nodes to themselves) nor multiple directed edges (two or more edges that connect the same two nodes in the same direction). Furthermore, the networks and subgraphs need not be connected.

We adopt the following commonly used definitions and terms to explain our algorithm design.

(Node-labeled directed graph).

A node-labeled directed graph $G$ is defined as $G=(V,E,L)$ where (i) $V=\{v_{1},v_{2},\ldots,v_{n}\}$ is a set of nodes or vertices, (ii) $E\subset V\times V$ is a set of edges in which $<u,v>$ denotes the edge from node $u$ to node $v$ , and (iii) for each node $v$ $\in$ $V$ , $L(v)$ is the label of $v$ .

The number of vertices, i.e., the cardinality of $V$ , denoted by $|V|=n$ , is called the order or size of graph $G$ . The network in Fig. 1 is a graph of size 5. A $k$ -(sub)graph is a (sub)graph of size $k$ .

(Induced subgraph).

Given two graphs $G_{h}=(V_{h},E_{h},L_{h})$ and $G_{g}=(V_{g},E_{g},L_{g})$ where $|V_{h}|\leqslant|V_{g}|$ . $G_{h}$ is said to be an induced subgraph of $G_{g}$ if there is an injective function $f$ that maps the nodes of $G_{h}$ to the nodes of $G_{g}$ such that (i) for each node $v$ in $V_{h}$ , there is a node $f(v)$ in $V_{g}$ where $L_{h}(v)=L_{g}(f(v))$ , and (ii) for all ordered pairs of nodes $u$ , $v$ in $V_{h}$ , edge $<u,v>$ is in $E_{h}$ if and only if edge $<f(u),f(v)>$ is in $E_{g}$ .

Please note that the induced subgraph isomorphism problem is different from the general subgraph isomorphism problem. In dealing with induced subgraph isomorphism, the absence of an edge $e$ in $G_{h}$ implies that $e$ must also be absent in $G_{g}$ whereas $e$ may be present in $G_{g}$ when dealing with general subgraph isomorphism. In this work, we deal with induced subgraph isomorphism instead of general subgraph isomorphism.

(Support).

Given a set $\cal D$ of directed graphs or networks, the support of a graph $H$ , denoted $\textit{sup}(H)$ , is the number of networks $G\in\cal D$ where $H$ is an induced subgraph of $G$ .

(All frequent induced subgraph mining).

Given a set $\cal D$ of directed networks and a minimum support (minsup), the all frequent induced subgraph mining problem is to find all induced subgraphs in $\cal D$ such that the support of each found induced subgraph is no less than minsup.

(Node-adjacency matrix).

Suppose we arbitrarily and uniquely number the nodes of a graph $G$ with $|V|=n$ nodes as $v_{1},v_{2},\ldots,v_{n}$ . Then, the node-adjacency matrix of $G$ , denoted by $\textit{NAM}(G)$ , is a square $n\times n$ matrix whose rows and columns correspond to the numbered nodes such that the element at the $i$ th row and the $j$ th column of $\textit{NAM}(G)$ is one if there is an edge from node $i$ to node $j$ , and zero otherwise.

In general, the form of a node-adjacency matrix depends on the node numbering, where the node numbering is given by a permutation of indices $1,\ldots,n$ . Suppose the node numbering in Fig. 1 is given by the permutation $\pi=(1,2,3,4,5)$ . Equation (1) below shows the node-adjacency matrix of the network in Fig. 1, where the node numbers are as specified in the parentheses in Fig. 1. Notice that, since the graphs we consider here do not have self loops, the diagonal entries of all node-adjacency matrices are zeros.

$\displaystyle A_{\pi}=\left(\begin{array}[]{ccccc}0&0&1&1&0\\ 1&0&0&0&0\\ 0&0&0&0&0\\ 0&1&1&0&0\\ 0&0&0&1&0\\ \end{array}\right)$ (1)

The above form would have been adequate for graphs without node labels. However, for a graph whose nodes have labels, its node-adjacency matrix needs to be augmented by its node label array in order to fully describe the graph. Given a particular permutation of node indices, we can always construct a node-adjacency matrix and its corresponding node label array. For example, the node-adjacency matrix of the network in Fig. 1 can be alternatively constructed based on a new permutation $\pi^{\prime}=(1,2,4,3,5)$ , as shown in Eq. (2). In the original permutation $\pi$ , we have $c(3)$ (the node labeled $c$ is numbered 3) and $b(4)$ (the node labeled $b$ is numbered 4). In the new permutation $\pi^{\prime}$ , these two nodes are renumbered such that the node labeled $b$ is now numbered 3 and the node labeled $c$ is now numbered 4. This permutation concept plays an important role in our network mining algorithm.

$\displaystyle A_{\pi^{\prime}}=\bordermatrix{&a(1)&a(2)&b(4)&c(3)&c(5)\cr a(1)% &0&0&1&1&0\cr a(2)&1&0&0&0&0\cr b(4)&0&1&0&1&0\cr c(3)&0&0&0&0&0\cr c(5)&0&0&1% &0&0}$ (2)

3. The proposed approach

3.1 Algorithm specific definitions and notation

(Node descriptor).

Given the node-adjacency matrix $A$ of graph $G=(V,E,L)$ where $V=\{v_{1},v_{2},\ldots,v_{n}\}$ , the node descriptor of the $i$ th node, $v_{i}$ , of $G$ is a pair comprising the node’s label and its connectivity string, defined as follows. Let

$\displaystyle A=\left(\begin{array}[]{cccc}a_{11}&a_{12}&...&a_{1n}\\ a_{21}&a_{22}&...&a_{2n}\\ ...&...&...&...\\ a_{n1}&a_{n2}&...&a_{nn}\\ \end{array}\right)$ (3)

where

$\displaystyle a_{ij}=\begin{cases}1&\mbox{if }<v_{i},v_{j}>\,\in E\\ 0&\mbox{if }<v_{i},v_{j}>\,\notin E\end{cases}$ (4)

Let $L(v_{i})=L(i)$ be the label of $v_{i}$ . Let $C(v_{i})=C(i)$ be the connectivity string of $v_{i}$ where

$\displaystyle C(i)=\begin{cases}a_{11}&\mbox{if }i=1\\ a_{12}a_{21}a_{22}&\mbox{if }i=2\\ a_{1i}a_{i1}\ldots a_{(i-1)i}a_{i(i-1)}a_{ii}&\mbox{if }i\geqslant 3\end{cases}$ (5)

The node descriptor of $v_{i}$ , denoted $nd(v_{i})=nd(i)$ , is defined as $nd(i)=(L(i),C(i))$ .

Table 1 shows the descriptor of each node in the matrix in Eq. (1) representing the network $G$ in Fig. 1. This matrix corresponds to a permutation of nodes in $G$ . Our scheme for encoding a node by its descriptor is incremental in the sense that in this permutation or node numbering, the rearer a node’s location is (i.e., the larger number the node has), the longer the node’s descriptor is. Since both node labels and connectivity strings can be treated as character strings, we can define a total order between node labels and between connectivity strings respectively based on the lexicographical order of strings. The following extends this total order to the ordering of node descriptors.

Table 1

Illustration of node descriptors

Label	Connectivity string	Node descriptor
$a(1)$	0	( $a$ , 0)
$a(2)$	010	( $a$ , 010)
$c(3)$	10000	( $c$ , 10000)
$b(4)$	1001010	( $b$ , 1001010)
$c(5)$	000000010	( $c$ , 000000010)

(Ordering of node descriptors).

For any two nodes $v_{i}$ and $v_{i^{\prime}}$ in graph $G=(V,E,L)$ , we say $nd(i)<nd(i^{\prime})$ if one of the following two conditions holds: (i) $L(v_{i})<L(v_{i^{\prime}})$ , or (ii) $L(v_{i})=L(v_{i^{\prime}})$ and $C(i)<C(i^{\prime})$ .

By concatenating all the node descriptors together, we can obtain a code for the whole node-adjacency matrix, as defined below.

(Matrix descriptor).

Given a node-adjacency matrix of graph $G$ , denoted $\textit{NAM}(G)$ , its matrix descriptor, denoted $md(\textit{NAM}(G))$ , is defined as the sequence formed by concatenating all the node descriptors of $G$ . That is, $md(\textit{NAM}(G))$ equals $nd(1)nd(2)\ldots nd(n)$ , where $nd(i)$ , $1\leqslant i\leqslant n$ , is the descriptor of the $i$ th node in the matrix.

Again, this scheme for coding a node-adjacency matrix is incremental in that the sequence is obtained by concatenating the node descriptors with gradually increasing lengths. Refer to the node-adjacency matrix in Eq. (1). By concatenating the node descriptors in Table 1, we obtain the following descriptor for the matrix: ( $a$ , 0) ( $a$ , 010) ( $c$ , 10000) ( $b$ , 1001010) ( $c$ , 000000010). In general, there are many node permutations for a given network $G$ with $n$ nodes, and these permutations may lead to the same or different node-adjacency matrices. To compare two matrix descriptors, we can treat each matrix descriptor as a string of $n$ special characters, where each special character corresponds to a node descriptor. We compare two special characters based on the ordering of their corresponding node descriptors. Then, we can compare the two matrix descriptors or strings based on their lexicographical order.

(Canonical form of graph).

Let $\cal M$ denote the set of all node-adjacency matrices for graph $G$ . There exists a matrix $A_{c}\in\cal M$ such that $md(A_{c})\leqslant md(A)$ for all $A\in\cal M$ , i.e., $md(A_{c})$ is the minimum matrix descriptor for $G$ . We say $A_{c}$ is canonicalized, i.e., $A_{c}$ is the canonical node-adjacency matrix or canonical form of $G$ . The graph $G$ is canonicalized if $G$ is represented by $A_{c}$ , i.e., the nodes of $G$ are numbered based on the node ordering in $A_{c}$ .

By introducing the canonical form of a graph, we are able to check whether a graph $G$ is a duplicate of another graph $G^{\prime}$ . Specifically, we first canonicalize each graph, i.e., represent each graph by its canonical form. Then compare the canonical forms of the graphs. If $G$ is a duplicate of $G^{\prime}$ , their canonical forms must be the same and vice versa.

While various canonical forms including the above have been previously studied [13, 17, 30], it should be pointed out that the following properties and algorithm design are new, which have not been addressed or are different from the previous methods.

.

$A_{c}$ is the canonical form of graph $G$ if and only if $nd_{A_{c}}(i)\leqslant nd_{A_{c}}(j)$ , $1\leqslant i\leqslant j\leqslant n$ , and $nd_{A_{c}}(i)\leqslant nd_{A}(i)$ , $1\leqslant i\leqslant n$ , for every node-adjacency matrix $A\in\cal M$ , where $nd_{A}(i)$ is the node descriptor of the $i$ th node based on the node numbering in $A$ and $\cal M$ is the set of all node-adjacency matrices for graph $G$ .

Proof..

(If) Since $A_{c}$ is the canonical form of $G$ , we must have $nd_{A_{c}}(1)\leqslant nd_{A_{c}}(2)\leqslant\ldots\leqslant nd_{A_{c}}(n)$ . If not, assume $nd_{A_{c}}(i)>nd_{A_{c}}(i+1)$ for some $i$ . We would switch the columns and rows of the $i$ th and $(i+1)$ th nodes of $A_{c}$ to get a new node-adjacency matrix $A^{\prime}$ such that $nd_{A^{\prime}}(i)<nd_{A_{c}}(i+1)<nd_{A_{c}}(i)$ . Thus, $md(A^{\prime})<md(A_{c})$ , contradicting the fact that $A_{c}$ is the canonical form. Similarly, since $A_{c}$ is the canonical form, we know that $nd_{A_{c}}(i)\leqslant nd_{A}(i)$ , $1\leqslant i\leqslant n$ , for every node-adjacency matrix $A\in\cal M$ . (Only if) Assume, for contradiction, that $A_{c}$ is not the canonical form. There would exist a node-adjacency matrix $A\in\cal M$ such that $nd_{A}(i)<nd_{A_{c}}(i)$ for some $i$ . This contradicts the fact that $nd_{A_{c}}(i)\leqslant nd_{A}(i)$ , $1\leqslant i\leqslant n$ , for every node-adjacency matrix $A\in\cal M$ . Hence, $A_{c}$ is the canonical form of $G$ , which completes the proof. ∎

(Core of node-adjacency matrix).

Let $A$ be a node-adjacency matrix with $k$ nodes (i.e., $A$ has $k$ rows and columns). The sub-matrix consisting of the first $k-1$ rows and columns (i.e., the first $k-1$ nodes) of $A$ is called the core of $A$ .

The concatenation of the node descriptors of the first $k-1$ nodes, referred to as the $(k-1)$ -prefix, of $A$ forms the matrix descriptor of the core of $A$ .

(Core-canonicalized form).

Let $A$ be a node-adjacency matrix. $A$ is said to be in core-canonicalized form if the core of $A$ is canonicalized.

According to Theorem 1, it is easy to see that if a matrix $A$ is in canonical form, the core of $A$ is automatically canonicalized.

(Equivalence class).

Let $G_{1}$ and $G_{2}$ be two graphs with the same size. The nodes of $G_{1}$ and $G_{2}$ are numbered based on some order, and their corresponding node-adjacency matrices are $A_{1}$ and $A_{2}$ respectively. $G_{1}$ and $G_{2}$ are said to be in the same equivalence class if $A_{1}$ and $A_{2}$ share the same core.

When $A_{1}$ and $A_{2}$ are canonicalized, their shared core is canonicalized too.

3.2 Properties of the canonical form

Let $G$ be a graph of $n$ nodes and $A_{c}$ be its canonical form. The nodes of $G$ are numbered based on the node ordering in $A_{c}$ , i.e., $G$ is canonicalized. The last node of $G$ is $v_{n}$ and the second to the last node of $G$ is $v_{n-1}$ .

Property 1.
Suppose we remove $v_{n}$ as well as the outgoing and incoming edges of $v_{n}$ from $G$ . Call the remaining subgraph $G^{\prime}$ . Then $G^{\prime}$ is canonicalized. To prove this property, we assume, for contradiction, that $G^{\prime}$ is not canonicalized for the remaining $n-1$ nodes. Suppose $G^{\prime\prime}$ is canonicalized for the remaining $n-1$ nodes. Let $\textit{NAM}(G^{\prime})$ be the node-adjacency matrix based on which the nodes of $G^{\prime}$ are numbered. Let $\textit{NAM}(G^{\prime\prime})$ be the node-adjacency matrix based on which the nodes of $G^{\prime\prime}$ are numbered. Then $md(\textit{NAM}(G^{\prime\prime}))<md(\textit{NAM}(G^{\prime}))$ . Thus, the concatenation of $md(\textit{NAM}(G^{\prime\prime}))$ and $nd(v_{n})$ would be smaller than the concatenation of $md(\textit{NAM}(G^{\prime}))$ and $nd(v_{n})$ , which equals $md(A_{c})$ . This contradicts the fact that $G$ is canonicalized. Hence $G^{\prime}$ must be canonicalized.
Property 2.
Suppose we remove any node $v$ before $v_{n}$ (i.e., $v$ ’s number is smaller than $v_{n}$ ’s number) as well as the outgoing and incoming edges of $v$ from $G$ . Call the remaining subgraph $G^{\prime}$ . $G^{\prime}$ may or may not be canonicalized. On the other hand, let $\textit{NAM}(G^{\prime})$ be the node-adjacency matrix based on which the nodes of $G^{\prime}$ are numbered. Then $\textit{NAM}(G^{\prime})$ must be in core-canonicalized form, i.e., the core of $G^{\prime}$ must be canonicalized. Of particular interest is a special case of this property where $v=v_{n-1}$ .

3.3 Algorithm design

The proposed FISHmine algorithm employs the Apriori principle [1], which says that if a pattern is frequent (i.e., its support is greater than or equal to minsup), then all its sub-patterns must also be frequent. Specifically, we join smaller subgraphs that are already frequent to generate candidate subgraphs of larger sizes. Each candidate subgraph must go through an induced subgraph isomorphism test [26, 28], and the candidate subgraph becomes a qualified pattern if it passes the test. The following subsections describe the major phases of our algorithm.

3.3.1 Phase 1: Single-node subgraph discovery

Any single-node subgraph is automatically canonicalized. It is straightforward to find all frequent 1-subgraphs (i.e., frequent node labels) and their supporting lists. Specifically, we pre-process the graphs in the dataset $\cal D$ . For each node label, we create a supporting list comprising the identifiers (ids) of the graphs that contain a node with that label.

3.3.2 Phase 2: Two-node subgraph discovery

The second phase of our pattern growth algorithm is to find all frequent 2-subgraphs (i.e., subgraphs of size 2). Phase 1 produces all frequent 1-subgraphs (i.e., frequent node labels). In phase 2, we generate candidate 2-subgraphs by considering all possible pairs of the frequent node labels. Suppose $a$ is a frequent node label. Figure 3 shows four possible candidate 2-subgraphs generated from $a$ , $a$ . Two of these four candidate subgraphs are actually identical to each other. Notice that a subgraph need not be connected (Fig. 3a shows a disconnected $2$ -subgraph). These examples in Fig. 3 will also be used to explain the self-join situation that occurs when generating candidate $(k+1)$ -subgraphs from frequent $k$ -subgraphs in phase 3 of our algorithm described in the following subsection.

Figure 3.

Four candidate $2$ -subgraphs generated from node labels $a$ , $a$ .

For each candidate 2-subgraph, its supporting list is constructed by taking the intersection of the supporting lists of the two corresponding node labels. If the cardinality of the intersection list is greater than or equal to minsup, we calculate the support of the candidate 2-subgraph by invoking the candidate pattern verification procedure (i.e., induced subgraph isomorphism test [26, 28]). We keep only frequent 2-subgraphs. Clearly, each frequent 2-subgraph is a frequent induced subgraph.

3.3.3 Phase 3:

(k+1)

-subgraph discovery via level-by-level pairwise join

Let $G_{1}$ and $G_{2}$ be two $k$ -subgraphs that are in the same equivalence class. Let $A_{1}$ and $A_{2}$ be their corresponding node-adjacency matrices. By definition, $A_{1}$ and $A_{2}$ share the same core. Consider how to join $G_{1}$ and $G_{2}$ to generate $(k+1)$ -subgraphs. Let the nodes of $G_{1}$ ( $G_{2}$ respectively) be $u_{1},\ldots,u_{k}$ ( $v_{1},\ldots,v_{k}$ respectively). Look at the last nodes $u_{k}$ in $G_{1}$ and $v_{k}$ in $G_{2}$ . There are two cases. Note that these two cases do not exist for undirected graphs, since edges in undirected graphs do not have directions.

Case 1:
$nd(u_{k})\neq nd(v_{k})$ . Without loss of generality, assume $nd(u_{k})<nd(v_{k})$ . In this case, we generate four $(k+1)$ -subgraphs by assigning $v_{k}$ to be the last node and $u_{k}$ to be the second to the last node in all the four $(k+1)$ -subgraphs. Furthermore, we create edges as follows. In the first $(k+1)$ -subgraph, there is no edge between $u_{k}$ and $v_{k}$ . In the second $(k+1)$ -subgraph, there is an edge from $u_{k}$ to $v_{k}$ . In the third $(k+1)$ -subgraph, there is an edge from $v_{k}$ to $u_{k}$ . In the fourth $(k+1)$ -subgraph, there is an edge from $u_{k}$ to $v_{k}$ and there is an edge from $v_{k}$ to $u_{k}$ .
Case 2:
$nd(u_{k})=nd(v_{k})$ . Thus, the label of $u_{k}$ must be the same as the label of $v_{k}$ . Here, we deal with a self-join situation. We can generate three $(k+1)$ -subgraphs as follows. As in case 1, $v_{k}$ is the last node and $u_{k}$ is the second to the last node in all the three $(k+1)$ -subgraphs. In the first $(k+1)$ -subgraph, there is no edge between $u_{k}$ and $v_{k}$ . In the second $(k+1)$ -subgraph, there is an edge from $u_{k}$ to $v_{k}$ . In the third $(k+1)$ -subgraph, there is an edge from $u_{k}$ to $v_{k}$ and there is an edge from $v_{k}$ to $u_{k}$ . We do not need to generate the $(k+1)$ -subgraph $G$ in which there is an edge from $v_{k}$ to $u_{k}$ , since the labels of $u_{k}$ and $v_{k}$ are the same and $G$ would be the same as the second $(k+1)$ -subgraph above. Figure 3 illustrates this self-join case.

Observe that the $(k+1)$ -subgraphs generated from two $k$ -subgraphs may or may not be canonicalized. If both of the two $k$ -subgraphs are already canonicalized, the generated $(k+1)$ -subgraphs are also canonicalized. If neither of the two $k$ -subgraphs is canonicalized, the generated $(k+1)$ -subgraphs need to be further canonicalized. This necessitates a canonicalization process, through which the candidate subgraphs are canonicalized so that we can detect and remove duplicate candidate subgraphs that may possibly exist. We then calculate the support values of the resulting non-redundant candidate subgraphs by invoking the candidate pattern verification procedure (i.e., induced subgraph isomorphism test [26, 28]).
3.3.4 Missed candidate patterns

Performing candidate subgraph generation by considering only canonical forms may miss some candidate patterns. For example, consider the node-adjacency matrix $A$ below, which is in canonical form.

$\displaystyle A=\bordermatrix{&a&a&a&a&a\cr a&0&0&0&0&1\cr a&0&0&0&0&1\cr a&0&% 0&0&0&0\cr a&0&0&1&0&0\cr a&0&0&0&0&0}$ (6)

After removing the last node, we get the following matrix $A_{1}$ , which is also in canonical form.

$\displaystyle A_{1}=\bordermatrix{&a&a&a&a\cr a&0&0&0&0\cr a&0&0&0&0\cr a&0&0&% 0&0\cr a&0&0&1&0}$ (7)

On the other hand, if we remove the second to the last node, we get the following matrix $A_{2}$ , which is not in canonical form. Rather, the corresponding canonical form should be $A_{3}$ .

$\displaystyle A_{2}=\bordermatrix{&a&a&a&a\cr a&0&0&0&1\cr a&0&0&0&1\cr a&0&0&% 0&0\cr a&0&0&0&0}$ (8)

$\displaystyle A_{3}=\bordermatrix{&a&a&a&a\cr a&0&0&0&0\cr a&0&0&0&1\cr a&0&0&% 0&1\cr a&0&0&0&0}$ (9)

To avoid duplicate patterns of size 4, all frequent 4-subgraphs must have been canonicalized. As a result, $A_{2}$ must be represented by $A_{3}$ . However, when joining the two canonical forms $A_{1}$ and $A_{3}$ , we cannot get $A$ back. Thus we would miss $A$ in the process of joining frequent 4-subgraphs to generate candidate 5-subgraphs.

3.3.5 De-canonicalization process

To avoid missing candidate patterns, we propose a de-canonicalization process which aims to de-canonicalize $A_{3}$ to restore $A_{2}$ so as to generate all candidate 5-subgraphs from frequent 4-subgraphs. To de-canonicalize a canonical form, we examine the permutations of the node indices in the canonical form. Notice that it is impossible for any permutation to have a smaller core than the core of the canonical form, because it would violate the definition of the canonical form. In addition, those permutations whose cores are larger than the core of the canonical form are discarded. As a result, only the permutations that share the same core as the canonical form are kept. These permutations are in the same equivalence class, i.e., they share the same canonicalized core. For a pair of frequent $k$ -subgraphs in their core-canonicalized forms, our joining algorithm generates no more than four candidate $(k+1)$ -subgraphs. As we will show later, this de-canonicalization process ensures that no candidate pattern is missed during the pattern growth phase.

3.3.6 The FISHmine algorithm

The framework of FISHmine is given in Algorithm 1. At step 7, the algorithm finds all frequent $(k+1)$ -subgraphs based on frequent $k$ -subgraphs in $\textit{FS}_{k}$ by (i) applying the de-canonicalization process to $\textit{FS}_{k}$ to obtain $\textit{FS}_{k}^{d}$ , (ii) performing pairwise join of graphs in each equivalence class in $\textit{FS}_{k}^{d}$ to generate all candidate $(k+1)$ -subgraphs, (iii) invoking the canonicalization process to detect and remove duplicate candidate $(k+1)$ -subgraphs, and (iv) applying the induced subgraph isomorphism test [26, 28] to each non-redundant candidate $(k+1)$ -subgraph to calculate the support of the candidate $(k+1)$ -subgraph. Those candidate $(k+1)$ -subgraphs whose support values are greater than or equal to minsup become frequent $(k+1)$ -subgraphs and are stored in $\textit{FS}_{k+1}$ .

Discovering frequent induced subgraphs from directed networks

[1] FISHmine $\textit{FS}_{1}\leftarrow$ Find all frequent 1-subgraphs $\textit{FS}_{2}\leftarrow$ Find all frequent 2-subgraphs $\textit{FS}\leftarrow\textit{FS}_{1}\cup\textit{FS}_{2}$ $k\leftarrow 2$ $|\textit{FS}_{k}|>0$ $\textit{FS}_{k+1}\leftarrow$ Find all frequent $(k+1)$ -subgraphs based on frequent $k$ -subgraphs in $\textit{FS}_{k}$ $\textit{FS}\leftarrow\textit{FS}\cup\textit{FS}_{k+1}$ $k\leftarrow k+1$ FS

3.4 Analysis of the algorithm

The proposed algorithm is correct in the sense that (1) it is complete, meaning that it finds all frequent induced subgraphs, (2) the found patterns are not redundant, and (3) any discovered pattern is indeed a frequent induced subgraph. We note that the completeness property in (1) does not mean that our algorithm finds all important patterns in real-world applications since how to define all important patterns requires the involvement of domain experts. The non-redundancy property in (2) is guaranteed because we use the canonicalization process to detect and remove duplicate patterns, and each frequent pattern is uniquely represented by its canonical form. The property of (3) is also guaranteed because we use the induced subgraph isomorphism test to calculate the support of a pattern, and only frequent patterns, i.e., patterns whose support values are greater than or equal to minsup, are output by the algorithm. We now show our algorithm is complete.

.

The FISHmine algorithm does not miss any qualified pattern in the sense that it can discover all frequent induced subgraphs.

Proof..

We prove the theorem by induction.

Basis: It is easy to see that all frequent single-node subgraphs (i.e., 1-subgraphs) and 2-subgraphs can be found correctly by the algorithm.

Inductive step: Let $P$ be a frequent $(k+1)$ -subgraph, $k\geqslant 2$ . We show that $P$ must be generated by our algorithm. Let $A_{c}$ be the canonical form of $P$ . By removing the last node of $A_{c}$ or $P$ , we obtain a $k$ -subgraph, denoted by $P_{1}$ , which is automatically canonicalized according to Property 1. By removing the second to the last node of $P$ , we obtain a $k$ -subgraph, denoted by $P_{2}$ , which may or may not be canonicalized according to Property 2, but its core is automatically canonicalized. Assume all frequent $k$ -subgraphs have been found correctly (induction hypothesis). Through the de-canonicalization process, all the candidate $(k+1)$ -subgraphs, including $P$ , must be generated from the frequent $k$ -subgraphs, which proves the completeness of our algorithm. ∎

Notice that when applying the de-canonicalization process to each frequent $k$ -subgraph, $k!$ permutations are considered. Suppose there are $m$ frequent $k$ -subgraphs. Our algorithm generates $O(4\times(m\times k!)^{2})=O((m\times k!)^{2})$ candidate $(k+1)$ -subgraphs. The canonicalization process considers $(k+1)!$ permutations to obtain the canonical form of a $(k+1)$ -subgraph. The space used by our algorithm is linearly proportional to the dataset size, the size of patterns and the number of frequent patterns found by the algorithm.

4. Experiments and results

4.1 Performance evaluation of the algorithm

The proposed FISHmine algorithm is implemented in Java. We conducted a series of experiments to evaluate the performance of FISHmine using synthetic data. All the experiments were carried out on a Mac Pro, which has one Quad-Core Intel Xeon E5 processor, with 3.7 GHz speed. The processor has a total of 4 cores, with 10 MB L3 cache. The computer has 16 GB memory.

The synthetic data were created by a graph generator written by the authors. This generator has four parameters: (1) the lower bound of graph sizes (LB); (2) the upper bound of graph sizes (UB); (3) the number of graphs in the dataset (T); and (4) a flag F indicating whether node labels are unified (0), duplicate (1) or unique (2). By unified labels, we mean that all nodes in all graphs in the dataset have the same label. This is equivalent to say that all the nodes are unlabeled. By duplicate labels, we mean that a node label can repeatedly occur in a graph and also in other graphs in the dataset. For example, the graph in Fig. 1 has duplicate node labels (two nodes have the same label $a$ ). By unique labels, we mean that the node labels in a graph must be unique and distinct, i.e., no two nodes in the graph have the same label, though a label can occur in different graphs in the dataset. The degree distribution of the generated graphs follows a power law.

The performance measures used in the experimental study included the running time of our algorithm and the number of qualified patterns found by the algorithm. These performance measures were calculated with respect to different minsup values and dataset sizes where graphs’ node labels were unified, duplicate, or unique. Using these performance measures allowed us to acquire an in-depth knowledge of the behavior of our algorithm, and the effectiveness of the de-canonicalization procedure employed by the algorithm.

Figure 4.

Effect of minsup on the running time of our algorithm.

In the first experiment, we created three datasets, named D100_5_10_0, D100_5_10_1, and D100_5_10_2, respectively, by using the following parameter values of the graph generator: LB $=$ 5, UB $=$ 10, T $=$ 100 and F $=$ 0, 1, 2. We ran the FISHmine algorithm on each dataset ten times, with the minsup values changing from 4% to 18%. Figure 4 shows the average running times of our algorithm on each dataset. When minsup becomes large, few candidate patterns qualify to be solutions, and consequently the time spent by the algorithm decreases. For graphs with unified node labels (i.e., unlabeled graphs), the algorithm generates few candidate patterns from these graphs, specifically one 1-subgraph and three 2-subgraphs, and hence the algorithm requires less time. By contrast, for graphs with unique node labels, many candidate 1-subgraphs and 2-subgraphs can be generated, and hence the algorithms requires more time to verify them. We also observed that, in generating candidate $k$ -subgraphs, when $k$ becomes larger, the algorithm has to consider more permutations. As a result, the algorithm generates more candidate patterns, and hence requires more running time to verify the candidates.

Figure 5 shows the number of qualified patterns found by our algorithm as a function of minsup. As minsup increases, the number of qualified (i.e., frequent) patterns decreases. For graphs with unified node labels (i.e., unlabeled graphs), since matching a pattern with a graph only considers their structures without checking their node labels, many candidate patterns pass the verification step and become qualified patterns. By contrast, for graphs with unique node labels, the verification procedure considers both their structures and node labels, and hence the algorithm generates relatively fewer qualified patterns.

Figure 5.

Effect of minsup on the number of frequent patterns found by our algorithm.

Figure 6 details the number of qualified (i.e., frequent) patterns with respect to different pattern sizes where $\textit{minsup}=$ 4%. It can be seen from the figure that, with large pattern sizes (e.g., size $\geqslant$ 7), our algorithm is still able to find qualified patterns of these sizes for graphs with unified node labels. However, for graphs with duplicate and unique node labels, there is no frequent pattern of the large sizes.

Figure 6.

Number of frequent patterns of different sizes.

In the next experiment, we created 10 different datasets by varying the dataset sizes, from 100 to 1000. The minsup value was fixed at 4%; the other parameters had the same values as those in the previous experiment. Table 2 shows the running times (in minutes) of our algorithm on these 10 datasets with the three different types of graphs respectively. It can be seen from the table that, as the dataset size increases, the time required by our algorithm increases. This is understandable since with larger datasets, the algorithm needs to spend more time in performing the induced subgraph isomorphism test.

Table 2

Effect of the dataset size on the running time of our algorithm

Dataset size	Unified	Duplicate	Unique
100	33.16	8.40	153
200	23.26	19.05	200
300	46.4	29.2	304
400	63.22	37.4	351
500	61.9	37.50	371
600	98.14	55.1	521
700	121.20	58.47	529
800	118.10	67.1	751
900	162.16	77.3	691
1000	155.56	78.05	726

Figure 7 shows the number of frequent patterns found by our algorithm for varying dataset sizes. Since the minsup value is fixed at 4%, a qualified pattern needs to occur in more (fewer, respectively) input graphs in a larger (smaller, respectively) dataset. As a result, the number of qualified patterns is roughly the same for the different dataset sizes with respect to each type of graphs. For unlabeled graphs (i.e., graphs with unified node labels), when matching a pattern with a graph, our algorithm only considers their structures and hence produces more qualified patterns, as explained earlier. For graphs with duplicate and unique labels, the algorithm has to consider both their structures and node labels when matching a pattern with a graph. Consequently, relatively fewer qualified patterns are found. These results confirm that the more constrains a pattern must satisfy, the fewer qualified patterns one can find.

Figure 7.

Effect of the dataset size on the number of frequent patterns found by our algorithm.

We also investigated the effectiveness of the de-canonicalization process. Table 3 shows the number of qualified patterns found by our FISHmine algorithm that comprises the de-canonicalization procedure. In addition, the table also shows the results from a baseline algorithm, which is the same as FISHmine except that the de-canonicalization step is not implemented in the baseline algorithm. Level $k$ in the table lists the number of frequent $k$ -subgraphs found by each algorithm. The dataset used in this experiment contains 1000 graphs with LB $=$ 5, UB $=$ 10, $\textit{minsup}=$ 8%, where the graphs contain duplicate node labels. We can see from Table 3 that FISHmine finds 398 frequent patterns whereas the baseline algorithm only finds 352 frequent patterns. The baseline algorithm misses 46 qualified patterns, indicating the effectiveness of the proposed de-canonicalization process. We also observed that there were totally 149 duplicate patterns in the pattern growth stage. These duplicate patterns were detected by FISHmine and discarded.

Table 3

Impact of the de-canonicalization process

Level	Baseline	FISHmine
1	1	1
2	3	3
3	15	15
4	163	194
5	90	103
6	53	55
7	22	22
8	5	5
Total	352	398

In the last experiment, we compared the proposed FISHmine algorithm with a closely related tool, gSpan [30]. Since gSpan is designed for undirected graphs whereas FISHmine is mainly for directed networks, we converted each input undirected graph $G$ into a directed network $G^{\prime}$ as follows. For each edge $\{u,v\}$ in $G$ , we created two edges in $G^{\prime}$ including an edge $<u,v>$ from $u$ to $v$ and an edge $<v,u>$ from $v$ to $u$ . We then fed the input undirected graphs and their converted directed networks to gSpan and FISHmine, respectively.

Experimental results showed that these two tools detected quite different patterns. Specifically, FISHmine can detect disconnected subgraphs whereas gSpan can not. For example, a pattern with two isolated nodes $a$ , $b$ that are not connected to each other can be detected by FISHmine. Such disconnected subgraphs can never be found by gSpan. On the other hand, all frequent subgraphs detected by FISHmine must be induced while gSpan is able to find general subgraphs that are not necessarily induced.

4.2 An application in gene network inference

As an application of the proposed work, we used FISHmine to identify common patterns in gene regulatory networks (GRNs) constructed by GRN inference tools [19, 25]. A GRN is represented by a directed graph in which each node represents a transcription factor or target gene, and each edge from node $u$ to node $v$ indicates that $u$ regulates the expression of $v$ . Many tools have been developed for GRN inference using gene expression profiles [19], though the tools are far from perfect. We hope to improve these network inference tools through the use of FISHmine to find common patterns in the results produced by the tools. These common patterns may reveal true regulatory relationships between transcription factors and target genes. This is similar to finding common patterns in phylogenetic trees constructed by different tree inference tools so as to identify the true evolutionary relationships among species [35].

We collected seven popular network inference tools including CLR [8], GENIE3 [11], MRNET [22], Inferelator [2], Jump3 [12], ScanBMA [31] and TimeDelay-ARACNE (TD-ARACNE) [36]. Among these seven tools, CLR, GENIE3 and MRNET are designed for steady state data while Inferelator, Jump3, ScanBMA and TimeDelay-ARACNE are for time-series data [19]. The datasets used here were taken from the dialogue for reverse engineering assessments and methods (DREAM) challenges [20]. The CLR, GENIE3 and MRNET tools were run on the DREAM4 [20] 10-gene knockdown dataset. The Inferelator, Jump3, ScanBMA and TimeDelay-ARACNE tools were run on the DREAM4 10-gene time-series dataset [4]. The 10-gene gold standard available from the DREAM4 challenges was used as the ground truth in the study.

Table 4 shows the ground truth and the seven networks inferred by the seven tools respectively for the ten genes G1, G2, $\ldots$ , G10 in the DREAM4 datasets. When minsup is set to 3/7, and the minimum size of patterns of interest is 3, FISHmine identifies a qualified pattern (connected subgraph) that occurs in the ground truth. This pattern comprises edges $<$ G1, G2 $>$ and $<$ G1, G5 $>$ . This result is encouraging, indicating that FISHmine can help identify true regulatory links between genes using the networks produced by the existing GRN inference tools.

Table 4
Results from the seven tools and ground truth used in the study

Truth	CLR	GENIE3	MRNET	Inferelator		Jump3	ScanBMA	TD-ARACNE
$<$ G1, G2 $>$	$<$ G1, G2 $>$	$<$ G1, G3 $>$	$<$ G1, G5 $>$	$<$ G1, G4 $>$	$<$ G6, G3 $>$	$<$ G1, G2 $>$	$<$ G1, G2 $>$	$<$ G1, G2 $>$
$<$ G1, G3 $>$	$<$ G1, G6 $>$	$<$ G1, G5 $>$	$<$ G1, G6 $>$	$<$ G1, G5 $>$	$<$ G6, G5 $>$	$<$ G1, G3 $>$	$<$ G1, G5 $>$	$<$ G1, G4 $>$
$<$ G1, G4 $>$	$<$ G1, G7 $>$	$<$ G1, G7 $>$	$<$ G1, G7 $>$	$<$ G2, G3 $>$	$<$ G6, G9 $>$	$<$ G1, G4 $>$	$<$ G2, G10 $>$	$<$ G1, G5 $>$
$<$ G1, G5 $>$	$<$ G1, G10 $>$	$<$ G2, G6 $>$	$<$ G1, G10 $>$	$<$ G2, G4 $>$	$<$ G6, G10 $>$	$<$ G1, G5 $>$	$<$ G4, G6 $>$	$<$ G1, G6 $>$
$<$ G3, G4 $>$	$<$ G2, G4 $>$	$<$ G2, G8 $>$	$<$ G2, G4 $>$	$<$ G2, G5 $>$	$<$ G7, G3 $>$	$<$ G1, G8 $>$	$<$ G5, G1 $>$	$<$ G1, G8 $>$
$<$ G3, G7 $>$	$<$ G2, G5 $>$	$<$ G2, G10 $>$	$<$ G2, G5 $>$	$<$ G2, G9 $>$	$<$ G7, G5 $>$	$<$ G2, G4 $>$	$<$ G5, G4 $>$	$<$ G2, G3 $>$
$<$ G4, G3 $>$	$<$ G2, G10 $>$	$<$ G3, G1 $>$	$<$ G3, G4 $>$	$<$ G2, G10 $>$	$<$ G7, G9 $>$	$<$ G2, G5 $>$	$<$ G6, G8 $>$	$<$ G2, G9 $>$
$<$ G6, G2 $>$	$<$ G3, G4 $>$	$<$ G3, G7 $>$	$<$ G3, G5 $>$	$<$ G3, G2 $>$	$<$ G8, G4 $>$	$<$ G2, G8 $>$	$<$ G7, G2 $>$	$<$ G3, G8 $>$
$<$ G7, G3 $>$	$<$ G3, G5 $>$	$<$ G3, G8 $>$	$<$ G3, G6 $>$	$<$ G3, G6 $>$	$<$ G8, G5 $>$	$<$ G2, G9 $>$	$<$ G7, G3 $>$	$<$ G3, G10 $>$
$<$ G7, G4 $>$	$<$ G3, G6 $>$	$<$ G4, G2 $>$	$<$ G3, G10 $>$	$<$ G3, G7 $>$	$<$ G8, G9 $>$	$<$ G3, G4 $>$	$<$ G7, G4 $>$	$<$ G5, G2 $>$
$<$ G8, G2 $>$	$<$ G3, G10 $>$	$<$ G4, G9 $>$	$<$ G4, G7 $>$	$<$ G3, G10 $>$	$<$ G8, G10 $>$	$<$ G3, G7 $>$	$<$ G7, G5 $>$	$<$ G5, G7 $>$
$<$ G8, G6 $>$	$<$ G4, G7 $>$	$<$ G5, G1 $>$	$<$ G4, G10 $>$	$<$ G4, G1 $>$	$<$ G9, G2 $>$	$<$ G4, G8 $>$	$<$ G7, G9 $>$	$<$ G6, G2 $>$
$<$ G9, G10 $>$	$<$ G4, G10 $>$	$<$ G6, G2 $>$	$<$ G5, G8 $>$	$<$ G4, G2 $>$	$<$ G9, G6 $>$	$<$ G5, G1 $>$	$<$ G8, G5 $>$	$<$ G6, G3 $>$
$<$ G10, G3 $>$	$<$ G5, G8 $>$	$<$ G6, G8 $>$	$<$ G6, G8 $>$	$<$ G4, G5 $>$	$<$ G9, G7 $>$	$<$ G5, G2 $>$	$<$ G8, G6 $>$	$<$ G7, G3 $>$
$<$ G10, G4 $>$	$<$ G6, G10 $>$	$<$ G7, G1 $>$	$<$ G7, G8 $>$	$<$ G4, G8 $>$	$<$ G9, G8 $>$	$<$ G5, G3 $>$	$<$ G8, G10 $>$	$<$ G7, G9 $>$
	$<$ G7, G8 $>$	$<$ G7, G3 $>$	$<$ G7, G9 $>$	$<$ G4, G10 $>$	$<$ G9, G10 $>$	$<$ G5, G4 $>$	$<$ G9, G7 $>$	$<$ G7, G10 $>$
	$<$ G7, G9 $>$	$<$ G7, G5 $>$	$<$ G8, G9 $>$	$<$ G5, G1 $>$	$<$ G10, G2 $>$	$<$ G6, G8 $>$	$<$ G9, G10 $>$	$<$ G8, G10 $>$
	$<$ G8, G9 $>$	$<$ G8, G2 $>$		$<$ G5, G2 $>$	$<$ G10, G3 $>$	$<$ G7, G3 $>$	$<$ G10, G3 $>$	$<$ G9, G10 $>$
		$<$ G8, G6 $>$		$<$ G5, G4 $>$	$<$ G10, G4 $>$	$<$ G7, G9 $>$	$<$ G10, G4 $>$
		$<$ G8, G10 $>$		$<$ G5, G6 $>$	$<$ G10, G6 $>$	$<$ G8, G6 $>$	$<$ G10, G7 $>$
		$<$ G9, G4 $>$		$<$ G5, G7 $>$	$<$ G10, G8 $>$	$<$ G9, G8 $>$	$<$ G10, G9 $>$
		$<$ G10, G2 $>$		$<$ G5, G8 $>$	$<$ G10, G9 $>$	$<$ G9, G10 $>$
						$<$ G10, G1 $>$
						$<$ G10, G8 $>$
						$<$ G10, G9 $>$

5. Conclusions

We have presented a new algorithm (FISHmine) capable of discovering all frequent induced subgraphs, both connected and disconnected, from directed networks. We have proved the correctness, particularly the completeness, of the proposed algorithm. Our FISHmine algorithm differs from existing graph mining algorithms [10, 13, 17, 23, 30] in several ways. First, FISHmine aims to discover frequent induced subgraphs, both connected and disconnected, from directed networks. By contrast, the existing algorithms mainly find connected general, not induced, subgraphs in undirected networks. Because of the differences in the networks and patterns found in the networks, the pattern growth procedure used by FISHmine differs from those employed by the existing algorithms. Moreover, we incorporate a novel de-canonicalization component into our pattern mining process to ensure that our algorithm is complete. A shortcoming is that this de-canonicalization procedure is time-consuming, where $k!$ permutations must be considered for each frequent $k$ -subgraph in the pattern mining process. Thus, our approach can not handle very large patterns or graphs (e.g., graphs with more than 100 nodes).

We have implemented our algorithm in Java. This implementation has been applied and experimented on synthetic directed graphs with unified, duplicate or unique node labels, as well as gene regulatory networks with 10 genes. Our experimental results demonstrated the effectiveness of the proposed algorithm and its potential use in gene network inference. In practice, the patterns found by FISHmine become hypotheses, which can be validated in wet lab experiments [7]. In the future, we plan to extend FISHmine to handle more complicated networks whose edges are labeled or associated with weights, and whose nodes have attributes. We will also look into applications of FISHmine in various domains besides biology.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive suggestions and useful comments. This research was supported in part by the National Key R&D Program of China (Nos. 2016YFB1000602 and 2017YFB0701501), MOE Research Center for the online education foundation (No. 2016ZD302), National Natural Science Foundation of China (Nos. 61440057, 61272087, 61363019 and 11690023), and SUNY Oneonta Faculty Research and Creative Activity Grant Program.

References

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.

Bonneau

Reiss

D.J.

Shannon

Facciotti

Hood

Baliga

N.S.

and Thorsson

, The Inferelator: An algorithm for learning parsimonious regulatory networks from systems biology data sets de novo, Genome Biology 7(5) (2006).

Bonnici

Giugno

Pulvirenti

Shasha

and Ferro

, A subgraph isomorphism algorithm and its application to biochemical data, BMC Bioinformatics 14 (2013), S13.

Byron

Herbert

K.G.

and Wang

J.T.L.

, Bioinformatics Database Systems, CRC Press, 2016.

Chang

Healey

McHugh

J.A.M.

and Wang

J.T.L.

, Mining the World Wide Web: An Information Search Approach, Springer, New York, 2001.

Cygan

Jakub

and Arkadiusz

, The hardness of subgraph isomorphism, arXiv preprint arXiv:1504.02876, 2015.

Elloumi

Iliopoulos

C.S.

Wang

J.T.L.

and Zomaya

A.Y.

(eds), Pattern Recognition in Computational Molecular Biology: Techniques and Approaches, John Wiley & Sons, Inc., 2015.

Faith

J.J.

Hayete

Thaden

J.T.

Mogno

Wierzbowski

Cottarel

Kasif

Collins

J.J.

and Gardner

T.S.

, Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles, PLoS Biology 5(1) (2007).

Fan

Wang

and Wu

, Graph homomorphism revisited for graph matching, Proceedings of the VLDB Endowment 3(1) (2010), 1161–1172.

10.

Huan

Wang

and Prins

, Efficient mining of frequent subgraphs in the presence of isomorphism, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 549–552.

11.

Huynh-Thu

V.A.

Irrthum

Wehenkel

and Geurts

, Inferring regulatory networks from expression data using tree-based methods, PLoS One 5(9) (2010), e12776.

12.

Huynh-Thu

V.A.

and Sanguinetti

, Combining tree-based and dynamical systems for the inference of gene regulatory networks, Bioinformatics 31(10) (2015), 1614–1622.

13.

Inokuchi

Washio

and Motoda

, Complete mining of frequent patterns from graphs: Mining graph data, Machine Learning 50(3) (2003), 321–354.

14.

Inokuchi

Washio

and Motoda

, A General framework for mining frequent subgraphs from labeled graphs, Fundamenta Informaticae 66 (2005), 53–82.

15.

Jiang

Coenen

and Zito

, A survey of frequent subgraph mining algorithms, The Knowledge Engineering Review 28(1) (2013), 75–105.

16.

Kimelfeld

and Phokion

, The complexity of mining maximal frequent subgraphs, ACM Transactions on Database Systems 39(4) (2014): 32.

17.

Kuramochi

and Karypis

, Finding frequent patterns in a large sparse graph, in: Proceedings of the SIAM International Conference on Data Mining, 2004.

18.

Lin

Zhong

Duan

Jin

and Bi

, A directed labeled graph frequent pattern mining algorithm based on minimum code, in: Proceedings of the International Conference on Multimedia and Ubiquitous Engineering, 2009, pp. 353–359.

19.

Lingeman

J.M.

and Shasha

, Network Inference in Molecular Biology: A Hands-on Framework, Springer, New York, 2012.

20.

Marbach

Prill

R.J.

Schaffter

Mattiussi

Floreano

and Stolovitzky

, Revealing strengths and weaknesses of methods for gene network inference, Proceedings of the National Academy of Sciences of the United States of America 107(14) (2010), 6286–6291.

21.

Matsumura

Goldberg

D.E.

and Llora

, Mining directed social network from message board, in: Proceedings of the 14th International Conference on World Wide Web, 2005, pp. 1092–1093.

22.

Meyer

P.E.

Kontos

Lafitte

and Bontempi

, Information-theoretic inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, 2007.

23.

Nijssen

and Kok

J.N.

, The Gaston tool for frequent subgraph mining, Electron. Notes Theor. Comput. Sci. 127(1) (2005), 77–87.

24.

Oliver

R.M.

and Smith

J.Q.

(eds), Influence Diagrams, Belief Nets and Decision Analysis, John Wiley & Sons, Inc., 1990.

25.

Patel

and Wang

J.T.L.

, Semi-supervised prediction of gene regulatory networks using machine learning algorithms, Journal of Biosciences 40(4) (2015), 731–740.

26.

Shasha

Wang

J.T.L.

and Giugno

, Algorithmics and applications of tree and graph searching, in: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2002, pp. 39–52.

27.

Ullmann

J.R.

, An algorithm for subgraph isomorphism, J. ACM 23(1) (1976), 31–42.

28.

Wang

J.T.L.

Zhang

and Chirn

G.W.

, Algorithms for approximate graph matching, Information Sciences 82(1–2) (1995), 45–74.

29.

Wang

J.T.L.

Shasha

Shapiro

B.A.

Rigoutsos

and Zhang

, Finding patterns in three dimensional graphs: Algorithms and applications to scientific data mining, IEEE Transactions on Knowledge and Data Engineering 14(4) (2002), 731–749.

30.

Yan

and Han

, gSpan: Graph-based substructure pattern mining, in: Proceedings of the 2002 IEEE International Conference on Data Mining, 2002.

31.

Young

W.C.

Raftery

A.E.

and Yeung

K.Y.

, Fast Bayesian inference for gene regulatory networks using ScanBMA, BMC Systems Biology 8(1) (2014).

32.

Zaki

M.J.

, Efficiently mining frequent trees in a forest, in: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.

33.

Zhang

and Wang

J.T.L.

, New techniques for mining frequent patterns in unordered trees, IEEE Transactions on Cybernetics 45(6) (2015), 1113–1125.

34.

Zhang

and Wang

J.T.L.

, Mining frequent agreement subtrees in phylogenetic databases, in: Proceedings of the SIAM International Conference on Data Mining, 2006, pp. 222–233.

35.

Zhang

and Wang

J.T.L.

, Discovering frequent agreement subtrees from phylogenetic data, IEEE Transactions on Knowledge and Data Engineering 20(1) (2008), 68–82.

36.

Zoppoli

Morganella

and Ceccarelli

, TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach, BMC Bioinformatics 11 (2010).

Discovering frequent induced subgraphs from directed networks

Abstract

Keywords

1. Introduction

2. Preliminaries

(Node-labeled directed graph).

(Induced subgraph).

(Support).

(All frequent induced subgraph mining).

(Node-adjacency matrix).

3.1 Algorithm specific definitions and notation

(Node descriptor).

(Ordering of node descriptors).

(Matrix descriptor).

(Canonical form of graph).

.

Proof..

(Core of node-adjacency matrix).

(Core-canonicalized form).

(Equivalence class).

3.3.1 Phase 1: Single-node subgraph discovery

3.3.2 Phase 2: Two-node subgraph discovery

3.3.6 The FISHmine algorithm

3.4 Analysis of the algorithm

.

Proof..

4. Experiments and results

4.1 Performance evaluation of the algorithm

Table 4 Results from the seven tools and ground truth used in the study

Footnotes

Acknowledgments

References

Table 4
Results from the seven tools and ground truth used in the study