Abstract
Graphs are considered to be one of the best studied data structures in discrete mathematics and computer science. Hence, data mining on graphs has become quite popular in the past few years. The problem of finding frequent itemsets in conventional data mining on transactional databases, thus transformed to the discovery of subgraphs that frequently occur in the graph dataset containing either single graph or multiple graphs. Most of the existing algorithms in the field of frequent subgraph discovery adopts an Apriori approach based on generation of candidate set and test approach. The problem with this approach is the costlier candidate set generation, particularly when there exist more number of large subgraphs. The research goals in frequent subgraph discovery are to evolve (i) mechanisms that can effectively generate candidate subgraphs excluding duplicates and (ii) mechanisms that find best processing techniques that generate only necessary candidate subgraphs in order to discover the useful and desired frequent subgraphs. In this paper, a two phase approach is proposed by integrating Apriori algorithm on graphs to frequent subgraph (FS) tree to discover frequent subgraphs in graph datasets.
Introduction
Data Mining is the analysis step in the process of “knowledge discovery in databases” [1]. Data mining aims at extracting the interesting patterns and knowledge from large amounts of data. A significant cause is the explosive growth in the miscellaneous data that is being gathered within the business and experimental world. Main application areas include transportation systems, bioinformatics, social network analysis, communication networks, chemoinformatics, robotics, link analysis etc. The structures present in the real world data can be as simple as sequences that arises in the prediction of protein structure or trees in natural language parsing or it can be as complex as the web graphs and the citation graphs.
An effective method to model these data in the complex datasets is to use graphs. The most important aspect of graph modelling is that it helps to deal with the problems which were not solved before. Graphs serve as important data structures in many applications such as social network, web, bioinformatics etc. Some of the real world examples are shown in Fig. 1 [2–5]. Thus, the problem of finding frequent itemsets on transactional databases in conventional data mining, becomes the problem of extracting subgraphs that frequently occur in the graph dataset containing either a single graph or multiple graphs. The very important concept in graph based mining is the extraction of the frequent subgraphs present in the graph dataset. For example, consider the problem of finding recurring substructures in the mining of chemical compounds. This problem can be solved using a pattern discovery algorithm based on graphs, where each one of the compounds is represented as a graph, whose vertices denotes the atoms and edges denotes the bonds between them. Each vertex can be assigned a label corresponding to the atom involved and each edge can be assigned a label corresponding to the type of the bond. After creating these graphs, the recurring substructures present in different compounds become frequent ones.
Frequent subgraph mining is the procedure of finding out all subgraphs that are frequent in a graph dataset and that have an occurrence count greater than or equal to a predefined threshold. A simple example of frequent subgraph is shown in Fig. 2, where the graph dataset has four different graphs i.e G1, G2, G3, G4 and the subgraph shown in the bottom of Fig. 2 is present in graphs G1, G2 and G3. So it can be said that this subgraph has occurred frequently in the graph dataset with count = 3.
The goal of this paper is to extract all the frequent subgraphs within the graph dataset. To achieve this goal, a two-phase approach is proposed. The first phase uses the Apriori algorithm used in itemset discovery on graphs and the second phase constructs a compact data structure to store the subgraphs.
The rest of the paper is organized as follows: Section 2 discusses some of the existing works in the field of frequent subgraph mining, Section 3 includes definitions and background of frequent subgraph mining, Section 4 explains the proposed approach, Sections 5 and 6 contains experimental results and conclusion respectively.
Related work
The extraction of frequent subgraphs present in a graph dataset containing either single or a set of graphs has been popular in research area of data mining. For this, several frequent subgraph mining algorithms were proposed earlier.
Washio and Motoda [6] gave an interesting survey on data mining methods based on graphs. The earlier studies on finding frequent subgraphs in a single graph [7] and multiple graphs [8] used a greedy strategy to find some of the most interesting subgraphs. However, this strategy missed out some of the most important subgraphs, but complexity of the graph isomorphism problem was reduced. A frequent substructure based algorithm called Apriori Graph based Mining (AGM) [9] uses a level-wise search strategy and generate candidates based on vertex that increases the substructure size. The disadvantage was its high complexity due to multiple candidate generation. Frequent Subgraph algorithm(FSG) [10] generates candidate based on edges. In FSG, canonical labels are used to check isomorphic graphs. As it uses isomorphic testing, it is very costly and will also generate multiple candidates. Fast Frequent Subgraph Mining algorithm [11] (FFSM) uses vertical level search strategy to reduce the number of candidate generation. Methods used in FFSM algorithms are canonical adjacency matrix with FFSM-join and FFSM-extension, suboptimal CAM tree. Limitation of FFSM algorithm is that it is a NP-complete problem. GREW algorithm [12] can scale to large graphs and discovers non-trivial subgraphs that occur in the large portions of the input graph and in the lattice of frequent ones. Two versions of GREW are GREW-SE (single-edge collapsing) and GREW-ME (multi-edge collapsing) which can effectively operate on very large graphs that are sparse in nature. Dynamic GREW algorithm [13] consider dynamic graphs with edge insertions and deletions changing over time. Limitation of this algorithm is the overhead in identifying dynamic patterns.
Some existing works adopt a pattern-growth approach, which is a non Apriori-based algorithm. Few of them are discussed below.
Graph based Substructure pattern mining algorithm [14] (Gspan) uses DFS strategy, lexicographic order, minimum DFS code and rightmost extension. It discovers frequent substructures without candidate generation. CloseGraph algorithm [15] mines only closed frequent graph pattern. A frequent pattern is closed if there exists no proper super-pattern with the same support in the data set. It is more efficient than Gspan and also uses DFS strategy, lexicographic order, minimum DFS code and rightmost extension for finding closed frequent patterns. Subdue algorithm [16] focuses not only on the frequent pattern discovery but also compresses the graph data set. It uses a minimum description length (MDL) approach and compression based methodology in Subdue algorithm where the graphs are represented using the adjacency matrix. In GraphSig algorithm [17] each graph is converted to a set of feature vectors and each region in the graph is represented by a vector. To evaluate the statistical significance of patterns, prior probabilities of features are computed experimentally. So, GraphSig uses feature vector for graph representation. [18] Temporal subgraph pattern mining algorithm (TSP) give the patterns which forms a connective subgraph and contain temporal information. It does not generate unnecessary subgraphs. TSP algorithm used concatenation function, super pattern, forward unnecessary checking scheme, backward unnecessary checking scheme to find out closed frequent temporal subgraph [19, 20]. It uses an extension and DFS search strategy. Gaston discovers the frequent substructures in a number of phases of increasing complexity [21]. Further attempts, both in single and multiple input graphs in the field of frequent subgraphs mining, have been made by Yan et al. [22] and Inokuchi et al. [23], Bringmann and Nijssen [24], Huan et al. [25], Kuramochi and Karypis [26].
The most commonly used existing algorithms adopts an apriori based approach and its limitations are (i) multiple candidate generation, (ii) repeated database scan for very large dataset and (iii) expensive subgraph isomorphism checking. In this paper, a new approach is proposed to overcome these limitations.
Background
This section defines the notations and terms related to the problem.
Proposed method
In this paper, a new algorithm is presented for extracting all subgraphs that are connected and occuring frequently in a graph dataset containing either a single graph or a set of graphs. We use a two phase approach to achieve our goal. The first phase finds frequently occurred k-edge subgraphs using a level-by-level expansion in Apriori Algorithm [27]. Due to the limitation of the former algorithm we then introduce a second phase. The second phase is the projection of graph dataset to a tree data structure using the output of the first phase. The output of second phase gives all the remaining frequent subgraphs. The overall two phase approach is shown in Fig. 5. The proposed method starts with preprocessing step followed by Phase I and Phase II.
Preprocessing
A dataset containing a set of transactions can take different forms. So, this dataset is transformed to a set of graphs i.e. D = {G1, G2, G3, . . . G n }, each graph G i is represented as an undirected graph. Now, graph G i is of the form <V, E>, where V is a finite set of vertices and E is a finite set of edges. Then, each graph is represented using an adjacency list where graph associates each vertex with the collection of its neighbouring vertices or edges by which the memory is saved.
Phase 1
The phase I uses the Apriori Algorithm for finding frequent itemsets, as shown in Algorithm1, where edges correspond to items in Frequent Itemset discovery. In Frequent Itemset discovery, size of the frequent itemsets is extended by adding a single item each time. Similarly, in the proposed algorithm the size of the subgraphs that occur frequently are increased by adding an edge one after the other. Apriori Algorithm has two steps one is candidate-generation and other is support counting.
In this paradigm, the mining starts by enumerating all the k-edge graphs where k = 0, which are nothing but the vertices. Then in each iteration, the candidate subgraphs are generated by adding one edge to the previous frequent ones.
Candidate generation
Candidate generation phase starts by adding all the frequent k sized subgraphs (with k = 0) to the set of candidates. These candidates are then used to create k + 1 sized subgraphs by joining two frequent k sized subgraphs. The common k subgraph which is present in two k + 1 frequent subgraph is referred as core of that subgraph. The easiest way to obtain the complete set of candidate subgraphs is to join all pairs of k sized frequent subgraphs having a core. But this approach has a problem that, a specific k sized subgraph, can have atmost k different k – 1 sized subgraphs. Even if we consider all such possible subgraphs and all the resulting join operations, which will result in multiple generation of same candidate subgraph and generates candidate subgraphs that does not satisfy downward closure property. The proposed method addresses the problem by joining frequent subgraphs only if they share a common k – 1 sizedsubgraph.
Next, it counts the frequency for each of these candidates, and prunes subgraphs. This is done by computing the subgraphs’s support in the graph dataset D. If the minimum support threshold is lesser than the support, the given subgraph can be regarded as frequent and stored in the frequent set.
Support counting
Support counting of a subgraph g is very important, so as to determine whether g is a frequent subgraph or not. Computation of the support of subgraph g require finding the graphs {G1, G2 . . . G n } in which g occurs. An occurrence-list of a subgraph is maintained to count the support values. This list stores the vertex pair in each of the graphs being searched for. This eliminates the need for explicitly performing the subgraph isomorphism test, which itself is a NP problem, across all database graphs thereby improving the performance. When a subgraph is enlarged to get a child subgraph in the step of candidate generation, the embedding of the child subgraph must contain the embedding of the parent subgraph. Thus, each iteration ends up with a set of subgraphs that are frequent of fixed size. The extracted subgraphs that are frequent will satisfy the property of down closure, which helps to prune the latticeeffectively.
The Apriori algorithm fails when the dataset is very large because it has to scan the dataset multiple times resulting in increased computational time. It becomes costlier to handle a huge number of candidates and is tedious to scan the dataset number of times and check a large set of candidates by pattern matching. Due to this major drawback, Apriori algorithm was used to find the connected frequent one edge and two edge subgraphs. Thus the final output of Phase I is frequent two-edge subgraph. Using three graphs the Phase I is shown in Fig. 6.
Input:G = Graph dataset, σ = minimum support
G k = set of all subgraphs of size k
Output: Set of frequent subgraphs of cardinality 1 to k
1: k ← 0
2: F k ← Set of vertices in G
3:
4: Ck+1 ← candidate - gen (F k )
5:
6: g . count ← 0
7:
8:
9: g . count ← g . count + 1
10.
11.
12.
13: Fk+1 = Fk+1 ∪ g
14:
15:
16: k ← k + 1
17:
Now to find all the other connected frequent subgraphs, this output serves as the input to Phase II.
Input: G = Graph dataset, σ = minimum support
Output: Set of frequent subgraphs of cardinality 1 to k
1: Store adjacency list (g) , where g ∈ G
2: F0 ← set of vertices in G
3: k ← 1
4:
5: C k ← candidate - gen (Fk-1)
6:
7: support (c i ) = δ (g)/T
8:
9: F k . update (c i )
10:
11:
12: F = F ∪ F k
13: k ← k + 1
14:
15: BuildFS - Tree
16: Traverse FS - Tree for frequent subgraphs
Phase II
Here a tree structure named Frequent Subgraph tree (FS-tree) is used which takes the concept of prefix tree. The frequent two edge graph obtained from phase 1 serves as the candidates to FS-tree. Phase II of the proposed method includes following two-steps. Build a compact data structure, prefix-tree which can save considerable amount of memory for storing the graph transaction by projecting the graph dataset to this tree and is built using two passes over the dataset. Extracts frequent subgraphs directly from tree. Traversal through tree gives the frequent ones.
Building FS-tree
An FS-tree is fundamentally a prefix tree for the graph dataset,in which each path represents a set of graph that share the same prefix and each node represents one subgraph. The root of a tree is created and labelled with NULL. Now the children coming under the root node will be the previously obtained frequent two edge graphs. So the first level of the prefix tree is the frequently occurred two edge subgraphs where each node will have their support value also.
F: set of frequent two-edge subgraphs from Phase I.
Output: a tree with all subgraphs of cardinality 1 to k. F is the set of frequent two-edge subgraphs and support of each frequent subgraph. Create the root node of a FS-tree, T and label it as NULL. Add each subgraph in F as the child C
i
of root node where i={1,2,.,|F|} Call insert tree (S,T), where S is subgraph present in the set of graphs. The function insert tree(S,T) is performed as follows. If T has a child C and S is the one edge expansion of C, then insert S as the child of C, then increment S count by 1.
Now the next levels are obtained by scanning the database once, adding subgraphs that share same prefix as its child and incrementing occurrence by one.
Once the tree is constructed, each node represent the subgraphs in the graph transaction and its support value. Now all the subgraphs, that is less than the predefined or user-specified minimum threshold, can be eliminated. Thus all the subgraphs above the minimum threshold is considered to be frequent. The output of Phase II using minimum threshold as 2 is depicted in Fig. 7. The algorithm for the proposed approach is shown in Algorithm 2.
Experimental evaluation
The performance of the proposed algorithm is experimentally evaluated using real graphs derived from the mutagensis data and graphs that are synthetically generated. The former type of dataset allows to evaluate the effectiveness of the proposed approach for discovering the larger patterns and it’s scalability to large real datasets. The latter type of datasets allows us to evaluate the performance of our proposed approach on datasets whose characteristics changes (eg. number of graphs, average graph size, different support threshold values).
Experiment environment
The experiments were conducted on 2 GB DDR3 RAM and 2.13 GHz Pentium(R) CPU. All experiments were implemented using python. The operating system used was Ubuntu 14.04.
Experiments on synthetic data
For the multiple graph setting experiments, a graph dataset is generated synthetically. To start with, graph dataset with 10 graphs were generated with atmost 8 vertices and atmost 14 edges. Each graph in the dataset is stored in memory using adjacency list.
Initially, this graph dataset is experimented on the proposed approach to find frequent subgraphs. This algorithm was evaluated by changing the minimum threshold value. When minimum threshold was set to 0.1, 52 frequent one edge subgraphs and 20 two edge frequent subgraphs were found. So, when the threshold is very low, the number of frequent subgraph is very large. When minimum threshold is changed to 0.25, 15 frequent subgraphs with one edge and 8 frequent subgraphs with two edge were found. No further frequent subgraphs with 0.25 threshold. With 0.45 as minimum threshold, only 5 frequent subgraph with one edge and no further frequent subgraphs were found. Increasing the minimum threshold value resulted in lesser number of frequent subgraphs.
Later the size of the graph dataset was increased to 20 graphs. Again for lower minimum thresholds, algorithm generated large number of frequent subgraphs of different edges. But when the threshold was increased, the number of frequent subgraphs was very less.
GraphGen is a Graph generator for Neo4j. It generates a collection of labeled, undirected and connected graphs (http://graphgen.graphaware.com/#/). Using Graphgen, 10,000 graphs were generated and experimented on 100 graphs having atmost 20 vertices and 20 edges in each graph. Here also, minimum threshold value was changed as 0.3, 0.4 and 0.5. For minimum threshold 0.3, 60 frequent one edge subgraphs, 26 frequent two edge subgraphs and 10 frequent three edge subgraphs were identified from the FS-tree and no further frequent subgraphs obtained. With 0.4, 14 frequent one edge subgraph,10 frequent two edge subgraphs and 2 frequent three edge subgraphs were found from theFS-tree and no further frequent subgraphs. With 0.5, 8 frequent one edge subgraphs, 6 frequent two edge subgraphs.With 0.6, only two frequent one edge subgraphs with different execution times. From this experiment, it is proved that when the size of graph dataset is increased, the need for multiple database scans in traditional algorithm is eliminated in our algorithm by adding frequent two edge subgraphs to the FS-tree. The Table 1 shows the details of dataset used for evaluating results.
The graph shown in Fig. 8 for execution time against size of graphs for apriori algorithm and proposed algorithm where the execution time is measured in seconds with the minimum threshold as 0.3. From the experiments conducted, it can be concluded that as the number of input graphs in the graph dataset increases, the execution time in traditional algorithm also increases due to the multiple dataset scans. The comparison shows that proposed approach took less time than the traditional apriori algorithm. The graph shown in Figs. 9 and 10 for execution time against minimum threshold for 10 graphs and 1000 graphs respectively.
Experiments on real data
Experiment evaluation on real dataset is a must because it helps to evaluate the effectiveness of the proposed approach for finding the larger subgraphs and its scalability to other real world datasets. Experiments were conducted on the Mutagenesis data (http://www.eecs.wsu.edu/mgd/gdb.html). The Mutagenesis data describes several chemical compounds and classifies them as mutagenic or not mutagenic and which consists of 188 chemical compounds. Here, each chemical compound is taken as a graph. Thus, the mutagensis dataset contains 188 graphs. Each chemical compound data is preprocessed and converted to a graph format. To store this data in memory, adjacency list representation was used. The graph shown in Fig. 11 for execution time against varying minimum threshold for mutagensisdata.
Initially, started with the lowest minimum threshold 0.1. Around 70 frequent one edge subgraphs where discovered and 58 frequent two edge subgraphs. This is given to phase 2 where the FS-tree is constructed. There are 10 frequent three edge subgraphs obtained from the tree. The variations in the minimum threshold and execution time is similar to the earlier graphs we discussed.
Thus from the experiment conducted, it can be concluded that the execution time of the two phase approach for varying minimum threshold is less than the traditional algorithm because the multiple dataset scans where eliminated through the two phase approach. When the traditional apriori algorithm was implemented, it is found that the candidate generation step generates large numbers of subgraphs. Thus the algorithm attempts to load up the candidate set with as many as possible before each scan. As a result the time taken for execution is more. But the proposed approach, after finding all the two frequent subgraphs, each two frequent subgraph is added to the tree as nodes.
Then all the subgraphs that can be formed from the two frequent subgraph is added to the tree. Here, it is observed that the number of candidate subgraphs generated by the proposed method is lesser than that of the traditional algorithm.
From the experimental results it can be seen that Apriori takes more time to find frequent subgraphs and scan the dataset multiple times. The Table 2 shows the execution time for different input size and different minimum threshold with a comparison of two algorithms. The Table 3 shows the comparison of execution time of proposed method with FSG and SUBDUE algorithm with support level 10%.
Conclusion
Thus in this paper, an integrated technique that combines FS-Tree with the Apriori candidate generation is proposed so that the method overcomes the limitations of the Apriori. When tested with the synthetic graph dataset we understood the limitation of Apriori algorithm, that it has to scan the database multiple times. From the results it is evident that as the number of graphs in the input graph database increases the database scan increases and the algorithm has to generate many candidate subgraphs which are not even necessary. As the size of the input graph database increases, it is much expensive to handle a large number of candidate sets and also is tiresome to repeatedly scan the dataset and check a large set of candidates by pattern matching. The small sized k edge frequent subgraphs obtained from Apriori algorithm with the FS-tree are combined so as to overcome the limitation of Apriori algorithm. The experimental work on synthetic dataset and real dataset shows that the proposed method outperforms the apriori algorithm.
Footnotes
Acknowledgments
We extend our sincere gratitude towards Amrita Vishwa Vidyapeetham for providing the facilities for successfully completing this work. We thank Dr. M.R. Kaimal, Chairman of Department of Computer Science and Engineering for his valuable guidance and insightful comments during this work.
