Abstract
Mining maximal frequent patterns is significant in many fields, but the mining efficiency is often low. The bottleneck lies in too many candidate subgraphs and extensive subgraph isomorphism tests. In this paper we propose an efficient mining algorithm. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) and build search tree, avoiding too many candidate subgraphs. The second is to search the tree built in the first step in order, avoiding extensive subgraph isomorphism tests. The evaluation of our approach demonstrates the significant cost savings with respect to the state-of-the-art approach not only on the real-world datasets as well as on synthetic uncertain graph databases.
Introduction
With the emergence of large data, more and more people are concerned about data mining. The key technologies of them are to visualize the data. For example, the data objects are represented as nodes and the relationships among them are represented as edges. In this way data and relation are abstracted into graphs and therefore data mining is transformed into graph mining.
Formerly, the mining frequent subgraph patterns often focused on the certain graph (data of which is precise and complete) on which the association probability between every 2 vertexes was 1. However, in the real world the association among the vertices is uncertain due to errors or delays from data transmission or acquisition in the measurement system. Uncertainty is the intrinsic nature of the real world. Fig. 1 shows how a real world is correspondingly converted into certain graph and then converted into a corresponding uncertain one.

G1, G2 and G3 represent PPI network, certain graph and uncertain graph respectively.
In many fields (e.g. wireless sensor network (WSN), protein-protein interaction(PPI) network, intelligent transportation system, the social network, road network etc.), a lot of key information is often included in frequent patterns, so the work efficiency can be greatly improved if many frequent subgraph patterns are mined effectively. For example, by mining frequent patterns in PPI network, we can predict the relationship among proteins more effectively [1] and by mining frequent patterns in WSN, we can design routing protocols faster [2].
Given an uncertain graph database D as shown in Fig. 2, how can we mine frequent patterns in it efficiently? Before answer this question, we have to know the possible world semantic model [3, 4]. As for an uncertain graph G with n edges, there will be 2 n possible worlds. For example, the uncertain graph G1 in Fig. 2 implicates 8 possible worlds as shown in Fig. 3. The subgraph pattern g1 in Fig. 2 is embedded in the implicated graphs g5, g8 in Fig. 3, that is to say, g1 is subgraph isomorphic to g5, g8. Hence the expected support of g1 on G1 is sum of existence probability [5] of possible world g5, g8, namely 0.42. Similarly, the expected support of g1 on G2 is 0.42 and the expected support of g1 in D is 0.42 + 0.42 = 0.84. If g is a frequent subgraph pattern and the expected support of g is greater than user specific threshold, say 0.82, it’s no doubt that g1 in Fig. 2 is a frequent subgraph pattern. But note that g2 in Fig. 2 also satisfies the rule (0.9 > 0.82), is g2 also a subgraph pattern? It is obviously not because it occurs in D only once. Hence it is necessary to explain what is essence of “frequent”. In practice frequent pattern means that some data signals occur relatively more times, so “frequent” in certain graph database means that occurrence number of the pattern in database is relatively greater. But in uncertain graph database the situation changes. Some patterns which have lower occurrence frequencies, namely the numbers of occurrences, are with higher existence probabilities while others with higher occurrence frequencies have lower existence probabilities. As mentioned above, existence probability is generated from errors, delays occurring in data transmission or acquisition in the measurement system, so the patterns with lower existence probability may imply key information due to their higher occurrence frequencies. In short, as for uncertain graph database, “frequent” means higher occurrence frequency first of all, that is to say, the frequent patterns mined from uncertain graph database should guarantee higher occurrence frequency in the first.

Subgraph pattern in uncertain graph database.
Motivated by the above issues, we propose an efficient algorithm (LMFP) to mine maximal frequent subgraph patterns in uncertain graph database. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) in its frequency to guarantee that each selected edge is the most frequent, avoiding too many candidate subgraphs. The second key idea is to search always from upper to lower layers, from left to right, avoiding extensive subgraph isomorphism tests.
There are mainly three steps as below: Divide graph edges into sets: All the uncertain graphs are transformed to certain graphs and graph edges are divided into several categories in descending order of their frequency on certain graph. Build layered search tree: Build a three-dimensional search tree according to the edge sets divided in the first step. Search in layered tree of space: Search from upper to lower layers, from left to right in three-dimensional layered tree of space until K maximal frequent patterns are mined.
In conclusion, the main contributions of this paper are as follows. Scan the database only once, which reduces the running time greatly. Design an edge set divided algorithm in descending order of their frequency, which reduces the processing time from exponential complexity to quadratic complexity. Build a three-dimensional layered search tree with edge sets. This guarantees the efficiency and orderliness in search. Design a layered search algorithm to guarantee that subgraph mined each time is necessarily frequent, avoiding extensive subgraph isomorphism tests. Design a formula for calculating the threshold of the minimum layer number to improve the search efficiency.
The rest of paper is organized as follows: Section 2 reviews the related work. Section 3 defines the problem of the maximal frequent patterns in uncertain graph database. Section 4 presents our proposed LMFP algorithms. We present the experimental results in section 5 and conclude the paper in section 6.
The frequent pattern mining has been widely studied.
Certain frequent subgraph pattern mining
There are many frequent subgraph pattern mining algorithms on certain graph, which can be classified mainly into 2 categories. One is named level-wise approach, basing on a framework of the Apriori algorithm [6] and using a “bottom up” strategy, such as AGM, FSG. The other is named pattern-growth approach, basing on a framework of FP-Growth [7], such as gSpan, FFSM, CloseGraph etc. In this approach the number of occurrences of every graph is counted in the dataset and a FP-tree structure is built by inserting instances. Then graphs are sorted in descending order according to the occurrence frequency in the database.
Recently Vandana Bhatia [8] proposed a novel approximate subgraph mining algorithm implemented on distributed graph environment. The algorithm uses a novel two-step optimization to prune performing subgraph, overcoming the challenges of performing frequent subgraph mining on a massive large graph.
Uncertain frequent itemset mining
Until now a lot of research has been done on mining frequent itemset not only in uncertain databases [9–13] but also in uncertain data streams [14–15]. Similarly, uncertain frequent itemset mining is mainly classified into two categories, level-wise approach [10, 17] and pattern-growth approach [13, 20]. U-Apriori [10] was first proposed to mine uncertain frequent itemset. When it scans the given database each time, U-Apriori generates candidate patterns with length k + 1 out of patterns with length k. Hence U-Apriori has to scan the database many times. MBP [17] extracts valid pattern results from uncertain databases with relatively high efficiency due to applying the Poisson Cumulative Distribution Function. Another algorithm IMBP [16] reduces the number of candidate item sets, which shortens running time but loses accuracy.
Unlike level-wise approach, pattern-growth approach scans database only 2 times and none of candidate patterns is generated during the whole process. UF-Growth [19], a variation of the FP-tree, performs its own mining operations by employing a UF-tree. In UF-tree, each node stores not only item but also occurrence frequency and expected support. By this way, UF-Growth is more efficient than FP-tree mining algorithm.
Uncertain subgraph
Yuan employed a filtering-and-verification framework to speed up the pattern matching on big uncertain graphs [21]. Chen Lin constructed the cycle index of an uncertain random graph and presented a method to calculate the cycle index of an uncertain random graph [22]. Xiulian Gao proposed the concept of α-connected graph, α-connectedness index and then showed the computing method [23].
Uncertain frequent subgraph pattern mining
Zou et al. [24, 25] first proposed uncertain frequent subgraph pattern mining, then a new concept named expected support was proposed to evaluate the subgraph frequency [26]. Papapetrou et al. [27] proposed an index used to reduce the number of comparisons, which increases the mining efficiency. Mohamed Moussaoui et al. [28] proposed a novel mining approach on possibilistic flexible graph. The approach uses two similarity measures to approximate graph distance and to get possibilistic information of the semantic vertices, which can reduce time consumption.
Problem definition
The main symbols and notations used in the paper are summarized in Table 1. We use uppercase letter D for uncertain databases and G for uncertain graph of D. We use lowercase letter g to represent certain graph converted correspondingly by G. Hence G in D denotes an uncertain graph G in uncertain graph database D and gconby_G denotes a certain graph g converted correspondingly from uncertain graph G. We use g ⊆ T g′ to denote that g is subgraph isomorphic to g′. A minimum support threshold value is denoted by min _ sup and expected support of g on D is denoted by sup(g, D). It is denoted as G ⇒ g, that is, a certain graph g is implicated by uncertain graph G. We use p (e) to denote existence probability of edge e.
Symbols and Definition
Symbols and Definition
Obviously, if the existence probability of all the graph edges is 1, the uncertain graph is converted into a certain graph.
As shown in Fig. 1, G1 is a protein-protein interaction (PPI) network, on which each rectangular box denotes a protein, each line denotes interaction between 2 proteins and line thickness indicates the interaction strength between 2 proteins. G2 is a certain graph converted from G1, on which each vertex represents a protein correspondingly and each line proteins interaction correspondingly. G3 is a uncertain graph converted correspondingly from G2, on which association probability of each edge (substitute for line thickness) between 2 proteins indicates the interaction strength.
Obviously, as for an uncertain graph G with n edges, there will be 2
n
possible worlds as shown in Fig. 3. If a certain graph g is implicated by G [25], namely G ⇒ g, as for each edge e ∈ G, the existence probability of possible world g is calculated by

Possible worlds of G1 associated with a probability.
Here f is an injection from V to V′. l, l′ are the vertex label assignment functions respectively in g, g′.

Graph g2 is subgraph isomorphic to graph g1.
Subgraph isomorphism is a # P-complete problem [30] and therefore reducing subgraph isomorphism tests can decrease computational cost greatly.
we can get.
Secondly, we prove sup(g, D) = p (g, p, G
i
) × k. As mentioned above in definition 5,
No matter which edge is added to g1, its support is always less than the given min _ sup. So g1 is a maximal frequent pattern.
In this paper we study how to mine K maximal frequent patterns in the uncertain graph database. This problem can be formally defined as follows.
The algorithm mainly consists of three steps: Divide graph edge sets (EDCG): All the uncertain graphs are transformed to certain graphs and graph edges are divided into several categories in descending order according to their frequency on certain graph. Build layered search tree (BSSL): Build a three-dimensional layered search tree according to the edge sets divided in the first step. Search in layered tree space (SKFP): Search from upper to lower layers, from left to right in three-dimensional layered tree of space until K maximal frequent patterns are mined.
Divide edge into sets
Graph edges are divided into several categories in descending order according to their frequency on certain graph
In this step, all the uncertain graphs are transformed to certain graphs. That is, existence probability of each edge on uncertain graph is set to 1. Then graph edges on certain graph are divided into several categories in descending order of their frequency. Here e
i
with frequency (number of occurrences in database) of m is denoted by
All uncertain graphs are transformed into certain graphs and edge sets are divided into Edges of set E
m
are redivided. If As for Repeat the 2nd, 3rd step until E1 is derived.
This is shown as below.

Uncertain graph database D.
In this step, we build a layered search space. The key technology is that all the edge sets from the same mother graph are in the same three-dimensional space and there are no edges connecting different spaces, even if there are common vertexes in them. In the same three-dimensional space, edge sets with different occurrence frequency are on the different layer and if there are common vertexes in them, connecting them with virtual edges.
This can be summarized in Algorithm 2.
Put Put Repeat the 2nd step until all the edge sets are in search space. From upper to lower layer, if Let g
i
1
be encoded with minimum DFS code [31]. Note that minimum DFS code begins from upper to lower layer according to the sequence Since maximum value of i1 is n, there may be n′ graphs of g
i
1
(Here n′ ⩾ n). Transform
g i 1 back to corresponding uncertain graph G and we can get search graph set G = { G1, G2, . . . , Gn′ }.
This can be shown as below.
As for
Put g in search space and we can get min(d) = k. Hence

Build search space layered.
In this step we choose any one of graph sets G = { G1, G2, . . . , G n } achieved in algorithm 2, add an edge of G i into subgraph of g each time in the order of minimum DFS code and compute its value of sup. If sup(g ∪ e) < min _ sup, then g is a frequent subgraph. Backtrack and keep searching until K maximal frequent patterns are achieved.
This can be summarized in Algorithm 3.
Choose G
i
out of graph set G = { G1, G2, . . . , G
n
}. Choose an edge of G
i
in the order of minimum DFS code and add it to subgraph of g. If sup(g ∪ e) < min _ sup, then g is a frequent subgraph. Add it to Q. If not all of edges of G
i
are visited, then backtrack and turn to the step 2, else if not all the K maximal frequent patterns are mined, turn to the step 1. If K maximal frequent patterns are all mined, then end.
This is shown as below.
The value of sup(gconby_G, D)
The value of sup(gconby_G, D)
Let min _ sup = 0 . 6 and we can get sequence number 1,2,3,4,6,7 can satisfy sup(g) ⩾ min _ sup. Since the frequency of number 3,6,7 is 1, that is to say, they occur in D only once, they shouldn’t be chosen as maximal frequent patterns (in practice we prune the branch on the lower layers). Hence the maximal frequent patterns are number 1,2,4 as shown in Fig. 7.

Maximal frequent patterns with min _ sup = 0.6.
In this section we present the extensive experimental results and performance evaluation for LMFP. Since each subgraph achieved from algorithms is necessarily frequent, we design experiments only to evaluate the efficiency and the scalability. We use one dataset with real uncertain edges and other three datasets with synthetic uncertain edges to exam the performance of LMFP. Then we compare it with state-of-the-art approach, namely UGRAP [27] and MUSE [25].
Our methods are implemented on a Windows 7 machine with 2.5 GHz CPU of single core and 4GB RAM. Programs are all implemented in Visual C++6.0.
Datasets
The characteristics of these datasets above are shown in Table 3.
Datasets Summary
Datasets Summary
As we can see, different characteristics in different database are shown as dataset property, dataset category, number of graph, average edges per graph, average nodes per graph, average existence probability value per edge and number of distinct labels, that is to say, different characteristics lie in four aspects, the authenticity of edge probability, the number of graphs, the size of the graphs and the number of distinct labels. In four datasets, PPI is the only one with real edge probability while other three with synthetic edge probability. DTP has the largest number of graphs among the four datasets while its average edges and average nodes are smallest. In TCGA The average existence probability value per edge is the highest for 0.44. GeneCard contains the largest number of distinct labels for 11254 while DTP the smallest for 28.
As we know from section 4, every result of LMFP is necessarily maximal frequent pattern. So we don’t have to evaluate the precision and because we only find K maximal frequent patterns, the recall is not necessarily to be evaluated, either.
So we performed extensive experiments to evaluate the efficiency, the scalability and the memory usage. For comparison, we use 2 state-of-the-art approaches, UGRAP [27] and MUSE [25] The default parameter values ɛ, δ and min _ sup of MUSE were set to 0.1,0.1 and 0.3 respectively. UGRAP index was set to construct 2 Bloom filters per graph, covering paths of length 2 and 3. The default parameter values K and min _ sup of LMFP were set to 20 and 0.3 respectively. The experimental results are shown as below.
Efficiency
We first investigated the time efficiency of LMFP on the 4 databases with respect to the threshold min _ sup and the parameter K. Fig. 8(a) shows in PPI the execution time of 3 algorithms decreases when min _ sup varies from 0.3 to 0.8. This is because the higher min _ sup is, the less edges are needed for result frequent subgraph, then search time decreases naturally.

Execution time vs. support threshold min _ sup for the datasets (a) PPI, (b) DTP, (c) TCGA and (d) GeneCard.
We can also find the execution time increases rapidly while min _ sup is less than 0.4, because the edges of frequent subgraph increase obviously. Similarly, the execution time decreases with min _ sup increasing in other 3 synthetic databases as shown in Fig. 8(b) (c) (d). We can find execution time of algorithm LMFP is the shortest in 4 datasets and the difference of execution time becomes more apparent while the min _ supvalue is less than 0.4.
Fig. 9 shows the execution time of 3 steps (EDCG, BSSL, SKFP) in our algorithm. As we expected, the total execution time decreases when min _ sup becomes higher. Generally speaking, step 1st takes longer time than step 2 and step 3, because more time has to be spent in it to divide edges into sets.

Execution time of 3 steps (EDCG, BSSL, SKFP) vs. min _ sup.
Extensive experiments are also performed to examine the scalability of LMFP. For comparison, we use scalability evaluation methodology [24], that is to say, the size of datasets grows by duplicating the uncertain graphs in the database.
Fig. 10 plots the execution time of the three algorithms with respect to the number of duplications. As we can see, the execution time increases with the increase of the number of duplications. The increase is basically linear for all algorithms, LMFP increases more slowly due to our methods of dividing edges and building search space layered (The execution times of 3 steps are respectively shown in Fig. 9).

Execution time vs. number of duplications.
We also measured the memory usage. The result is shown in Fig. 11. The memory usage of 3 algorithms increases almost linearly to the increasing of the number of duplications. We can find the memory usage of MUSE is larger than the other 2 algorithms. In 3 algorithms LMFP has the smallest memory usage.

Memory usage vs. number of duplications.
In this paper we present a layered and efficient mining algorithm for maximal frequent patterns in uncertain graph database. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) in its frequency to guarantee that the edge selected each time is the most frequent, avoiding a large number of candidate subgraphs. The second is to search always from upper to lower, from left to right, avoiding extensive subgraph isomorphism tests. The evaluation of our approach demonstrates the significant cost savings with respect to the state-of-the-art approach not only on the real-world dataset but also on synthetic uncertain graph databases.
Footnotes
Acknowledgments
This work was supported by General Project of Hunan Provincial Education Department (NO. 17C0397) and Youth key Project of Hunan Institute of Engineering (NO. XJ1504).
