An efficient mining algorithm for maximal frequent patterns in uncertain graph database

Abstract

Mining maximal frequent patterns is significant in many fields, but the mining efficiency is often low. The bottleneck lies in too many candidate subgraphs and extensive subgraph isomorphism tests. In this paper we propose an efficient mining algorithm. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) and build search tree, avoiding too many candidate subgraphs. The second is to search the tree built in the first step in order, avoiding extensive subgraph isomorphism tests. The evaluation of our approach demonstrates the significant cost savings with respect to the state-of-the-art approach not only on the real-world datasets as well as on synthetic uncertain graph databases.

Keywords

Uncertain graph maximal frequent pattern data mining

1 Introduction

With the emergence of large data, more and more people are concerned about data mining. The key technologies of them are to visualize the data. For example, the data objects are represented as nodes and the relationships among them are represented as edges. In this way data and relation are abstracted into graphs and therefore data mining is transformed into graph mining.

Formerly, the mining frequent subgraph patterns often focused on the certain graph (data of which is precise and complete) on which the association probability between every 2 vertexes was 1. However, in the real world the association among the vertices is uncertain due to errors or delays from data transmission or acquisition in the measurement system. Uncertainty is the intrinsic nature of the real world. Fig. 1 shows how a real world is correspondingly converted into certain graph and then converted into a corresponding uncertain one.

Fig. 1

G₁, G₂ and G₃ represent PPI network, certain graph and uncertain graph respectively.

In many fields (e.g. wireless sensor network (WSN), protein-protein interaction(PPI) network, intelligent transportation system, the social network, road network etc.), a lot of key information is often included in frequent patterns, so the work efficiency can be greatly improved if many frequent subgraph patterns are mined effectively. For example, by mining frequent patterns in PPI network, we can predict the relationship among proteins more effectively [1] and by mining frequent patterns in WSN, we can design routing protocols faster [2].

Given an uncertain graph database D as shown in Fig. 2, how can we mine frequent patterns in it efficiently? Before answer this question, we have to know the possible world semantic model [3, 4]. As for an uncertain graph G with n edges, there will be 2ⁿ possible worlds. For example, the uncertain graph G₁ in Fig. 2 implicates 8 possible worlds as shown in Fig. 3. The subgraph pattern g₁ in Fig. 2 is embedded in the implicated graphs g₅, g₈ in Fig. 3, that is to say, g₁ is subgraph isomorphic to g₅, g₈. Hence the expected support of g₁ on G₁ is sum of existence probability [5] of possible world g₅, g₈, namely 0.42. Similarly, the expected support of g₁ on G₂ is 0.42 and the expected support of g₁ in D is 0.42 + 0.42 = 0.84. If g is a frequent subgraph pattern and the expected support of g is greater than user specific threshold, say 0.82, it’s no doubt that g₁ in Fig. 2 is a frequent subgraph pattern. But note that g₂ in Fig. 2 also satisfies the rule (0.9 > 0.82), is g₂ also a subgraph pattern? It is obviously not because it occurs in D only once. Hence it is necessary to explain what is essence of “frequent”. In practice frequent pattern means that some data signals occur relatively more times, so “frequent” in certain graph database means that occurrence number of the pattern in database is relatively greater. But in uncertain graph database the situation changes. Some patterns which have lower occurrence frequencies, namely the numbers of occurrences, are with higher existence probabilities while others with higher occurrence frequencies have lower existence probabilities. As mentioned above, existence probability is generated from errors, delays occurring in data transmission or acquisition in the measurement system, so the patterns with lower existence probability may imply key information due to their higher occurrence frequencies. In short, as for uncertain graph database, “frequent” means higher occurrence frequency first of all, that is to say, the frequent patterns mined from uncertain graph database should guarantee higher occurrence frequency in the first.

Fig. 2

Subgraph pattern in uncertain graph database.

Motivated by the above issues, we propose an efficient algorithm (LMFP) to mine maximal frequent subgraph patterns in uncertain graph database. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) in its frequency to guarantee that each selected edge is the most frequent, avoiding too many candidate subgraphs. The second key idea is to search always from upper to lower layers, from left to right, avoiding extensive subgraph isomorphism tests.

There are mainly three steps as below:

Divide graph edges into sets: All the uncertain graphs are transformed to certain graphs and graph edges are divided into several categories in descending order of their frequency on certain graph.

Build layered search tree: Build a three-dimensional search tree according to the edge sets divided in the first step.

Search in layered tree of space: Search from upper to lower layers, from left to right in three-dimensional layered tree of space until K maximal frequent patterns are mined.

In conclusion, the main contributions of this paper are as follows.

Scan the database only once, which reduces the running time greatly.

Design an edge set divided algorithm in descending order of their frequency, which reduces the processing time from exponential complexity to quadratic complexity.

Build a three-dimensional layered search tree with edge sets. This guarantees the efficiency and orderliness in search.

Design a layered search algorithm to guarantee that subgraph mined each time is necessarily frequent, avoiding extensive subgraph isomorphism tests.

Design a formula for calculating the threshold of the minimum layer number to improve the search efficiency.

The rest of paper is organized as follows: Section 2 reviews the related work. Section 3 defines the problem of the maximal frequent patterns in uncertain graph database. Section 4 presents our proposed LMFP algorithms. We present the experimental results in section 5 and conclude the paper in section 6.

2 Related work

The frequent pattern mining has been widely studied.

2.1 Certain frequent subgraph pattern mining

There are many frequent subgraph pattern mining algorithms on certain graph, which can be classified mainly into 2 categories. One is named level-wise approach, basing on a framework of the Apriori algorithm [6] and using a “bottom up” strategy, such as AGM, FSG. The other is named pattern-growth approach, basing on a framework of FP-Growth [7], such as gSpan, FFSM, CloseGraph etc. In this approach the number of occurrences of every graph is counted in the dataset and a FP-tree structure is built by inserting instances. Then graphs are sorted in descending order according to the occurrence frequency in the database.

Recently Vandana Bhatia [8] proposed a novel approximate subgraph mining algorithm implemented on distributed graph environment. The algorithm uses a novel two-step optimization to prune performing subgraph, overcoming the challenges of performing frequent subgraph mining on a massive large graph.

2.2 Uncertain frequent itemset mining

Until now a lot of research has been done on mining frequent itemset not only in uncertain databases [9 –13] but also in uncertain data streams [14–15]. Similarly, uncertain frequent itemset mining is mainly classified into two categories, level-wise approach [10 , 17] and pattern-growth approach [13 , 20]. U-Apriori [10] was first proposed to mine uncertain frequent itemset. When it scans the given database each time, U-Apriori generates candidate patterns with length k + 1 out of patterns with length k. Hence U-Apriori has to scan the database many times. MBP [17] extracts valid pattern results from uncertain databases with relatively high efficiency due to applying the Poisson Cumulative Distribution Function. Another algorithm IMBP [16] reduces the number of candidate item sets, which shortens running time but loses accuracy.

Unlike level-wise approach, pattern-growth approach scans database only 2 times and none of candidate patterns is generated during the whole process. UF-Growth [19], a variation of the FP-tree, performs its own mining operations by employing a UF-tree. In UF-tree, each node stores not only item but also occurrence frequency and expected support. By this way, UF-Growth is more efficient than FP-tree mining algorithm.

2.3 Uncertain subgraph

Yuan employed a filtering-and-verification framework to speed up the pattern matching on big uncertain graphs [21]. Chen Lin constructed the cycle index of an uncertain random graph and presented a method to calculate the cycle index of an uncertain random graph [22]. Xiulian Gao proposed the concept of α-connected graph, α-connectedness index and then showed the computing method [23].

2.4 Uncertain frequent subgraph pattern mining

Zou et al. [24, 25] first proposed uncertain frequent subgraph pattern mining, then a new concept named expected support was proposed to evaluate the subgraph frequency [26]. Papapetrou et al. [27] proposed an index used to reduce the number of comparisons, which increases the mining efficiency. Mohamed Moussaoui et al. [28] proposed a novel mining approach on possibilistic flexible graph. The approach uses two similarity measures to approximate graph distance and to get possibilistic information of the semantic vertices, which can reduce time consumption.

3 Problem definition

The main symbols and notations used in the paper are summarized in Table 1. We use uppercase letter D for uncertain databases and G for uncertain graph of D. We use lowercase letter g to represent certain graph converted correspondingly by G. Hence G_{in
_D} denotes an uncertain graph G in uncertain graph database D and g_{conby_G} denotes a certain graph g converted correspondingly from uncertain graph G. We use g ⊆ _Tg′ to denote that g is subgraph isomorphic to g′. A minimum support threshold value is denoted by min _ sup and expected support of g on D is denoted by sup(g, D). It is denoted as G ⇒ g, that is, a certain graph g is implicated by uncertain graph G. We use p (e) to denote existence probability of edge e.

Table 1
Symbols and Definition

Symbols Definition

G _{in
_D} an uncertain graph G in uncertain graph database D

g _{conby
_G} a certain graph g converted correspondingly by uncertain graph G

g ⊆ _Tg′ g is subgraph isomorphic to g′

min _ sup minimum support threshold value

sup(g, D) expected support of g on D

p (e) existence probability value of e

G ⇒ g a certain graph g is implicated by uncertain graph G

$p (\bar{e} | | e)$ existence probability value of e or nonexistence probability value of e

$e_{i}^{m}$ e_iwith frequency of m

E ^m edge set E in which all the edges has the same frequency of m

|G| the number of edges on graph G

Symbols	Definition
G _{in _D}	an uncertain graph G in uncertain graph database D
g _{conby _G}	a certain graph g converted correspondingly by uncertain graph G
g ⊆ _Tg′	g is subgraph isomorphic to g′
min _ sup	minimum support threshold value
sup(g, D)	expected support of g on D
p (e)	existence probability value of e
G ⇒ g	a certain graph g is implicated by uncertain graph G
$p (\bar{e} \| \| e)$	existence probability value of e or nonexistence probability value of e
$e_{i}^{m}$	e_iwith frequency of m
E ^m	edge set E in which all the edges has the same frequency of m
\|G\|	the number of edges on graph G

3.1 Uncertain graph model

Definition 1 (Uncertain edge). As for an edge e of uncertain graph G, if e ∈ G, then p (e) ∈ (0, 1].

Obviously, if the existence probability of all the graph edges is 1, the uncertain graph is converted into a certain graph.

As shown in Fig. 1, G₁ is a protein-protein interaction (PPI) network, on which each rectangular box denotes a protein, each line denotes interaction between 2 proteins and line thickness indicates the interaction strength between 2 proteins. G₂ is a certain graph converted from G₁, on which each vertex represents a protein correspondingly and each line proteins interaction correspondingly. G₃ is a uncertain graph converted correspondingly from G₂, on which association probability of each edge (substitute for line thickness) between 2 proteins indicates the interaction strength.

Definition 2 (Uncertain graph). Let V, E and Σ denote respectively the set of nodes, edges and labels, then system G = ((V, E) , Σ, L, P) is an uncertain graph, where L : V → 2^Σ is a function assigning labels to vertices or edges and P denotes the probability associated with the edges. Here we assume that G is an undirected graph and there is no influence between 2 edges, which exists widely in practice [1 , 29].

Definition 3 (Possible worlds). As for an uncertain graph G = (V, E, P), g = (V′, E′) is a possible world of G if the following conditions are satisfied. $\forall v \in V, v \in V^{'}$ (1) $\forall e \in E^{'}, e \in E,$ (2)

Obviously, as for an uncertain graph G with n edges, there will be 2ⁿ possible worlds as shown in Fig. 3. If a certain graph g is implicated by G [25], namely G ⇒ g, as for each edge e ∈ G, the existence probability of possible world g is calculated by $P (G \Rightarrow g) = \prod_{e \in g} P (e) \prod_{e \notin g} (1 - P (e))$ .

Fig. 3

Possible worlds of G₁ associated with a probability.

Example 1. As shown in g₄ of Fig. 3, there are only 2 edges, e₂ and e₃. The existence probabilities of e₂ and e₃ are both 0.6 while the nonexistence probability of e₁ is 0.3. The existence probability of possible world g₄ is P (g₄) = 0.6 × (1 - 0.7) × 0.6 = 0.108.

Definition 4 (Subgraph isomorphism). A certain graph g = ((V, E) , Σ, L) is subgraph isomorphic to another one g′ = ((V′, E′) , Σ′, L′), denoted by g ⊆ _Tg′, if they satisfy the following conditions. $\forall v \in V, f (v) \in V^{'}$ (3) $\forall v \in V, l (v) = l^{'} (f (v))$ (4)

$\begin{matrix} \forall (u, v) \in E, (f (u), f (v)) \in E^{'}, l (u, v) \\ = l^{'} (f (u), f (v)) \end{matrix}$ (5)

Here f is an injection from V to V′. l, l′ are the vertex label assignment functions respectively in g, g′.

Example 2. Graph g₂ is subgraph isomorphic to graph g₁ as shown in Fig. 4.

Fig. 4

Graph g₂ is subgraph isomorphic to graph g₁.

Subgraph isomorphism is a # P-complete problem [30] and therefore reducing subgraph isomorphism tests can decrease computational cost greatly.

Definition 5 (Expected support). As for a certain subgraph g_{conby_G_{in_D}} and each G_{in
_D}, given the probability p (g, p, G_i), if g is subgraph isomorphic to G_i, then p = p (g, p, G_i), else p = 0. The expected support of g in D is defined as below. $sup (g, D) = \sum_{G_{i} \in D} p$ (6)

Theorem 1. The expected support of g_{conby_G_{in_D}} can be calculated by sup(g, D) = p (g, p, G_i) × k. Here k is the occurrence number of the graph g in the D.

Proof. We assume that there are K edges on G_i and j edges on g. Firstly, we prove that g_{conby_G_{in_D}} is subgraph isomorphic to G_i in the same probability p (g, p, G_i), namely $\prod^{j} p (e_{j})$ .

we can get. $\begin{matrix} p (g, p, G_{i}) = \sum_{1}^{2^{k - j}} (\prod_{e_{j} \in g}^{j} p (e_{j}) \prod_{j + 1}^{k} p ({\bar{e}}_{k} ∥ e_{k})) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) + \dots + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 2^{k - k - 1} p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) \\ + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 2^{k - k - 1} p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{k}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 1 p ({\bar{e}}_{k} | | e_{k}) + \dots + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 1} 2^{k - j - 1} p ({\bar{e}}_{k} | | e_{k}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) + \dots + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 2^{k - k - 2} p ({\bar{e}}_{k} | | e_{k}) p (e_{k}) \\ + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 2^{k - k - 2} p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{k - 1}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 1 p ({\bar{e}}_{k} | | e_{k}) + \dots + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - 2} 2^{k - j - 2} p ({\bar{e}}_{k} | | e_{k}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{k - [k - (j + 1)]} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{j - 2}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - [k - (j + 1)] 2} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{j + 2}) \prod^{j} p (e_{j}) \prod_{j + 1}^{k - [k - (j + 1)]} 2^{k - j - [k - (j + 1)]} \\ p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{j - 2}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{k - [k - (j + 1)] 2} 2^{k - j - [k - (j + 1)]} p ({\bar{e}}_{k} | | e_{k}) p (e_{j + 2}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{j - 1} 1 p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{j + 2}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{j - 1} 1 p ({\bar{e}}_{k} | | e_{k}) p (e_{j + 2}) + \prod^{j} p (e_{j}) \prod_{j + 1}^{j - 1} 2 p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{j + 2}) \end{matrix}$

$\begin{matrix} + \prod^{j} p (e_{j}) \prod_{j + 1}^{j - 1} 2 p ({\bar{e}}_{k} | | e_{k}) p (e_{j + 2}) \\ = \prod^{j} p (e_{j}) \prod_{j + 1}^{j - 1} 1 p ({\bar{e}}_{k} | | e_{k}) p ({\bar{e}}_{j}) + \prod_{j + 1}^{j - 1} 2 p ({\bar{e}}_{k} | | e_{k}) \\ = \prod^{j} p (e_{j}) p ({\bar{e}}_{j + 1}) + \prod^{j} p (e_{j}) p (e_{j + 1}) = \prod^{j} p (e_{j}) \end{matrix}$

proved.

Secondly, we prove sup(g, D) = p (g, p, G_i) × k. As mentioned above in definition 5, $sup (g, D) = \sum_{G_{i} \in D} p = \sum_{g \subseteq_{T} G_{i}} p (g, p, G_{i}) = \prod^{j} p (e_{j}) \cdot k$ , proved.

Definition 6 (Frequent pattern). Given expected support threshold min _ sup, if sup(g, D) ⩾ min _ sup, g is a frequent subgraph pattern in D.

Definition 7 (Maximal frequent patterns). As for a uncertain frequent subgraph pattern g_{conby_G_{in_D}}, if ∀e ∈ D and e ∉ g, g ∪ e is a connected graph and sup(g ∪ e) < min _ sup, then g is a maximal frequent pattern.

Example 3. As shown in Fig. 2, Given min _ sup = 0 . 8, sup(g₁) = p (e₃) × p (e₁) × k = 0 .6 × 0 .7 × 2 =0 . 84 > min _ sup.

No matter which edge is added to g₁, its support is always less than the given min _ sup. So g₁ is a maximal frequent pattern.

Theorem 2. All the sub patterns of the maximal frequent patterns are also frequent.

Proof. If g is a maximal frequent pattern, $sup (g, D) = \sum_{G_{i} \in D} p = \sum_{g \subseteq_{T} G_{i}} p (g, p, G_{i}) = \prod^{j} p (e_{j}) \cdot k ⩾ \min_\sup$ . If g′ is one of sub patterns of g, $sup (g^{'}, D) = \prod_{m ⩾ 1}^{j - m} p (e_{j}) \cdot k ⩾ \prod^{j} p (e_{j}) \cdot k ⩾ \min_s up$ . Obviously, g′ is also frequent. Proved.

In this paper we study how to mine K maximal frequent patterns in the uncertain graph database. This problem can be formally defined as follows.

Input: An uncertain graph database D ={ G₁, G₂, . . . , G_n }, the expected support threshold of min _ sup and the number of maximal frequent patterns K.

Output: K maximal frequent patterns.

4 Algorithm for maximal frequent patterns in uncertain graph database

The algorithm mainly consists of three steps:

Divide graph edge sets (EDCG): All the uncertain graphs are transformed to certain graphs and graph edges are divided into several categories in descending order according to their frequency on certain graph.

Build layered search tree (BSSL): Build a three-dimensional layered search tree according to the edge sets divided in the first step.

Search in layered tree space (SKFP): Search from upper to lower layers, from left to right in three-dimensional layered tree of space until K maximal frequent patterns are mined.

4.1 Divide edge into sets

4.1.1 Graph edges are divided into several categories in descending order according to their frequency on certain graph

In this step, all the uncertain graphs are transformed to certain graphs. That is, existence probability of each edge on uncertain graph is set to 1. Then graph edges on certain graph are divided into several categories in descending order of their frequency. Here e_i with frequency (number of occurrences in database) of m is denoted by $e_{i}^{m}$ . All the $e_{i}^{m}$ can make up a edge set, denoted by E^m. That is, $E^{m} = {e_{1}^{m}, e_{2}^{m}, . . ., e_{i}^{m}}$ . In E^m if $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m}$ have the same mother graphs, that is, $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m} = G_{1} \cap G_{2} \cap . . . \cap G_{p}$ , then $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m}$ can make up a edge subset $E_{h}^{m}$ . Hence $E^{m} = {e_{1}^{m}, e_{2}^{m}, . . ., e_{i}^{m}}$ can be transformed into $E^{m} = E_{1}^{m} \cup E_{2}^{m} \cup . . . \cup E_{h}^{m}$ . Here $E_{1}^{m}, E_{2}^{m}, . . ., E_{h}^{m}$ have the same mother graphs respectively. Similarly, we can divide each $E_{h}^{m}$ into $E_{h}^{m - 1} = E_{h 1}^{m - 1} \cup E_{h 2}^{m - 1} \cup \cdot \cdot \cdot \cup E_{hp}^{m - 1}$ . Repeat the steps above until edge set with frequency threshold, namely $m - f = P = int (min_sup / \prod_{k}^{m} \prod_{i = 1}^{i^{'}} p (e_{i}) - 1)$ (Later strict proof is given), is derived. This is summarized in Algorithm 1.

Algorithm 1. Edges are divided on certain graph (EDCG)

Input: Uncertain database D ={ G₁, G₂, . . . , G_n }

Output: Several edge sets.

All uncertain graphs are transformed into certain graphs and edge sets are divided into $E^{m} = {e_{1}^{m}, e_{2}^{m}, . . ., e_{i}^{m}}$ . That is, all the edges with the same mother graphs, the number of which is m, are classified into a set E^m.

Edges of set E^m are redivided. If $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m} = G_{1} \cap G_{2} \cap . . . \cap G_{p} (G_{1}, G_{2}, . . ., G_{p} \in D)$ , that is, $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m}$ are from the same mother graphs, then $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m}$ can be classified into a subset $E_{1}^{m}$ out of E^m. Similarly, the set of E^m can be divided into $E^{m} = E_{1}^{m} \cup E_{2}^{m} \cup . . . \cup E_{h}^{m}$ .

As for $E_{1}^{m} = e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m}$ , if $e_{1}^{m - 1}, e_{2}^{m - 1}, . . ., e_{t}^{m - 1} = G_{1} \cap G_{2} \cap . . . \cap G_{p - 1} G_{1}, G_{2}, . . ., G_{p - 1} \in D$ , then $e_{1}^{m - 1}, e_{2}^{m - 1}, . . ., e_{t}^{m - 1}$ are classified into a set $E_{11}^{m - 1}$ . Keep searching in graph databases G₁, G₂, . . . , G_p until the edge set $E_{i}^{m - 1} = E_{i 1}^{m - 1} \cup E_{i 2}^{m - 1} \cup . . . \cup E_{ih}^{m - 1}$ is derived.

Repeat the 2nd, 3rd step until E¹ is derived.

This is shown as below.

Algorithm 1. EDCG
Input: uncertain database D ={ G₁, G₂, . . . , G_n }
Output: several edge sets
Begin
1. Initialization: certain graphs ← All uncertain
graphs in D, empty set E^m
2. for each edge $e_{i}$ do
3. if (frequency of e_i = = m) then
4. E^m ← e_i ∪ E^m
5. end if
6. end for
7. for each edge $e_{i}^{m} \in E^{m}$ do
8. if ( $e_{1}^{m}, e_{2}^{m}, . . ., e_{t}^{m} = G_{1} \cap G_{2} \cap . . . \cap G_{p}$ ) then
9. $E_{h}^{m} \leftarrow {e_{1}^{m}} \cup {e_{2}^{m}} \cup . . . \cup {e_{t}^{m}}$
10. end if
11. end for
12. while m> =m-f do
13. for each $E_{i}^{m} \in E^{m}$ do
14. if ( $e_{1}^{m - 1}, e_{2}^{m - 1}, . . ., e_{t}^{m - 1} = G_{1}^{'} \cap G_{2}^{'} \cap . . . \cap G_{p}^{'}$ ) then
15. $E_{i}^{m - 1} \leftarrow {e_{1}^{m - 1}} \cup {e_{2}^{m - 1}} \cup . . . \cup {e_{t}^{m - 1}}$
16. end if
17. end for
18. m = m - 1
19. end while
End

Example 4: Edges in Fig. 5 are divided. We can get that $E^{3} = E_{1}^{3} \cup E_{2}^{3}$ . Here $E_{1}^{3} = {e_{2} {, e}_{3}}$ and $E_{2}^{3} = {e_{5}}$ because e₂, e₃ have the same mother graph G₁, G₂, G₃, while e₅ has the same mother graph G₂, G₃, G₄. In $E_{1}^{3}$ , that is, from the same mother graph G₂, G₃, G₁, we can get $E^{2} = E_{11}^{2} \cup E_{12}^{2}$ . Here $E_{11}^{2} = {e_{1}}$ because it is derived from G₁, G₂. $E_{12}^{2} = {e_{4} {, e}_{5}}$ because it is derived from G₂, G₃. From graph G₃ we can get edge e₆ with frequency of 1, so $E_{121}^{1} = {e_{6}}$ . In $E_{2}^{3}$ , from the same mother graph G₂, G₃, G₄, we can get $E^{2} = {E_{21}^{2}}$ . Here $E_{21}^{2} = {e_{2} {, e}_{3} {, e}_{4}}$ , because it is derived from the same mother graph G₂, G₃. Finally, we can get $E^{1} = E_{211}^{1} \cup E_{212}^{1}$ . Here $E_{211}^{1} = {e_{1}}$ , $E_{212}^{1} = {e_{6}}$ .

Fig. 5

Uncertain graph database D.

Time complexity analysis: Since edge frequency is computed by comparing to each other, time complexity of algorithm EDCG is O (n²). Here n denotes the number of edges in uncertain graph D.

4.2 Build search space layered

In this step, we build a layered search space. The key technology is that all the edge sets from the same mother graph are in the same three-dimensional space and there are no edges connecting different spaces, even if there are common vertexes in them. In the same three-dimensional space, edge sets with different occurrence frequency are on the different layer and if there are common vertexes in them, connecting them with virtual edges.

This can be summarized in Algorithm 2.

Algorithm 2. Build search space layered (BSSL)

Input: Several edge sets

Output: Search graph sequence G = (G₁, G₂, . . . , G_n′)

Put $E_{i_{1}}^{m} (i_{1} = 1, 2, \cdot \cdot \cdot n)$ in the number m layer of three-dimensional space i₁.

Put $E_{i_{1} i_{2}}^{m - 1}$ in the number m - 1 layer and let space i₂ be embedded in space i₁. If there are common vertexes in space i₁ and i₂, connect them with virtual edge.

Repeat the 2nd step until all the edge sets are in search space.

From upper to lower layer, if $E_{i_{1}}^{m}, E_{i_{1} i_{2}}^{m - 1}, \dots, E_{i_{1} i_{2} . . . i_{k}}^{m - n}$ is connected by virtual edges, then merge them into a graph, denoted by g_{i
₁}.

Let g_{i
₁} be encoded with minimum DFS code [31]. Note that minimum DFS code begins from upper to lower layer according to the sequence $E_{i_{1}}^{m}, E_{i_{1} i_{2}}^{m - 1}, . . ., E_{i_{1} i_{2} . . . i_{k}}^{m - n}$ .

Since maximum value of i₁ is n, there may be n′ graphs of g_{i
₁} (Here n′ ⩾ n). Transform

g_{i
₁} back to corresponding uncertain graph G and we can get search graph set G = { G₁, G₂, . . . , G_n′ }.

This can be shown as below.

Algorithm 2. BSSL
Input: Several edge sets
Output: Search graph sequence G = (G₁, G₂, . . . , G_n′)
Begin
1. while m> =m-f do
2. for each $E_{i}^{m} \in E^{m}$ do
3. for each edge $e_{j}^{m} \in E_{i}^{m}$ do
4. put $e_{j}^{m}$ on NO. m layer
5. end for
6. end for
7.m = m - 1
8. end while
9. for each $V_{i}, V_{j} \in (E_{i_{1}}^{m}, E_{i_{1} i_{2}}^{m - 1} . . . E_{i_{1} i_{2} . . . i_{k}}^{m - n})$ do
10. if (V_i, V_j are in different Layer and i = j) then
11. connect $V_{i}$ and $V_{j}$ through a virtual edge
12. end if
13. end for
14. merge $E_{i_{1}}^{m}, E_{i_{1} i_{2}}^{m - 1} . . . E_{i_{1} i_{2} . . . i_{k}}^{m - n}$ into a graph, denoted by g_{i ₁}
15. for each g_ido
16. encode g_i. with minimum DFS code
17. transform g_i back to corresponding uncertain graph G_i.
18. end for
End

Example 5. A search graph space shown in Fig. 6 is built from edge sets divided from Fig. 5. As for E³ ={{ e₂, e₃ } , { e₅ }}, E² ={{ e₁, e₄, e₅ } , { e₂, e₄, e₃ }}, E¹ ={{ e₆ } , { e₁, e₆ }}, because {e₂, e₃}, {e₁, e₄, e₅}, {e₆} are derived from the same mother graph G₁, G₂, G₃ and {e₅}, {e₂, e₄, e₃}, {e₁, e₆} are from the other mother graph G₂, G₃, G₄, we put {e₂, e₃}, {e₁, e₄, e₅}, {e₆} in the space C₁ and {e₅}, {e₂, e₄, e₃}, {e₁, e₆} the othe space C₂. From algorithm 2 we can get 2 search graph sets, $G_{C_{1}} = {e_{3}^{3} e_{1}^{2} e_{2}^{3}, e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}}$ and $G_{C_{2}} = {e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}, e_{3}^{2} e_{6}^{1} {, e}_{5}^{3} e_{6}^{1}}$ .

Time complexity analysis. In uncertain graph database D ={ G₁, G₂, . . . , G_n }, as for $E_{i_{1}}^{m} (i_{1} = 1, 2, \cdot \cdot \cdot h)$ , the minimum value h of $E_{i_{1}}^{m} (i_{1} = 1, 2, \cdot \cdot \cdot h)$ is $C_{n}^{1}$ (Choose only one edge from n graphs) while the maximum value is $C_{n}^{n / 2}$ . Hence average time complexity is $\frac{2}{n} \sum_{i = 1}^{n / 2} O (C_{n}^{i})$ .

As for $E_{i_{1} i_{2}}^{m - 1} (i_{2} = 1, 2, \cdot \cdot \cdot h^{'})$ , time complexity is $\frac{2}{n} \sum_{i = 1}^{\frac{n}{2}} O (C_{n}^{i}) \times \frac{2}{m} \sum_{i = 1}^{\frac{m}{2}} O (C_{m}^{i})$ . Hence total time complexity is $\frac{2}{n} \sum_{i = 1}^{n / 2} O (C_{n}^{i}) + \frac{2}{n} \sum_{i = 1}^{n / 2} O (C_{n}^{i}) \times \frac{2}{m} \sum_{i = 1}^{m / 2} O (C_{m}^{i}) + \dots + \frac{2}{m - f + 1} \sum_{i = 1}^{(m - f + 1) / 2} O (C_{m - f + 1}^{i}) \times \frac{2}{m - f} \sum_{i = 1}^{(m - f) / 2} O (C_{m - f}^{i})$ .

Theorem 3. In the search space, the expected support of graph g is $sup (g, D) = \prod_{i = 1}^{n} p (e_{i}) \times min (d)$ . Here min(d) denotes minimum number of layer in search space, namely the minimum frequency of all edges of g.

Proof. As mentioned in theorem 1, sup(g, D) = p (g, p, G_i) × k. Here k denotes the occurrence number of the graph G_i in D. Since $p (g, p, G_{i}) = \prod_{i = 1}^{n} p (e_{i})$ , $sup (g, D) = \prod_{i = 1}^{n} p (e_{i}) \times k$ . Since graph G_i occurs k times in D and g ⊆ _Tg′ (G_i ⇒ g′), all of edges of g occur necessarily at least k times in D, that is to say, the minimum frequency of all edges of g is k.

Put g in search space and we can get min(d) = k. Hence $sup (g, D) = \prod_{i = 1}^{n} p (e_{i}) \times min (d)$ . Proved.

Example 6. In search space shown in Fig. 6, compute the expected support of g_{conby_G₁} in Fig. 5.

Fig. 6

Build search space layered.

$\begin{matrix} \sup (g, D) = p (e_{2}) \times p (e_{1}) \\ \times p (e_{3}) \times min (d) = 0.6 \times 0.7 \times 0.8 \times 2 = 0.672 . \end{matrix}$

4.3 Search for K maximal frequent patterns

In this step we choose any one of graph sets G = { G₁, G₂, . . . , G_n } achieved in algorithm 2, add an edge of G_i into subgraph of g each time in the order of minimum DFS code and compute its value of sup. If sup(g ∪ e) < min _ sup, then g is a frequent subgraph. Backtrack and keep searching until K maximal frequent patterns are achieved.

This can be summarized in Algorithm 3.

Algorithm 3. Search for K maximal frequent patterns (SKFP)

Input: Graph set G = { G₁, G₂, . . . , G_n } and empty set of maximal frequent patterns, named Q.

Output: K maximal frequent patterns set.

Choose G_i out of graph set G = { G₁, G₂, . . . , G_n }.

Choose an edge of G_i in the order of minimum DFS code and add it to subgraph of g.

If sup(g ∪ e) < min _ sup, then g is a frequent subgraph. Add it to Q.

If not all of edges of G_i are visited, then backtrack and turn to the step 2, else if not all the K maximal frequent patterns are mined, turn to the step 1. If K maximal frequent patterns are all mined, then end.

This is shown as below.

Algorithm 3. SKFP
Input: Graph set G = { G₁, G₂, . . . , G_n }, empty set Q
Output:K maximal frequent patterns set
Begin
1. /m is the number of maximal frequent patterns in Q,
n is the number of edges on g
2. for each G_i ∈ Gdo
3. ifn ⩽ \|G_i\| then // \|G_i\| is the number of edges of G_i
4. add an edge of G_i to g
5. if sup(g ∪ e) < min _ sup then
6. add g to Q, m = m - 1
7. ifm ⩾ Kthen End
8. end if
9. backtrack through an edge
10. end if
11. n = n + 1
12. end if
13. end for
End

Example 7. As for the search graph sets $G_{C_{1}} = {e_{3}^{3} e_{1}^{2} e_{2}^{3}, e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}}$ and $G_{C_{2}} = {e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}, e_{3}^{2} e_{6}^{1} {, e}_{5}^{3} e_{6}^{1}}$ , from algorithm 2, we can get the value of sup(g, D) shown as Table 2. Here $e_{m}^{n}$ denotes that the frequency of edge e_m is n.

Table 2
The value of sup(g_{conby_G}, D)

G sup(g_{conby_G}, D) sequence number

$e_{3}^{3} e_{1}^{2} e_{2}^{3}$ 0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672 1

$e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}$ $e_{2}^{3} e_{4}^{2} e_{5}^{2}$ 0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672 2

$e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}$ 0 . 8 ×0 . 7 ×0 . 6 ×0 . 9 ×1 = 0 . 6048 3

$e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}$ $e_{5}^{3} e_{4}^{2} e_{2}^{2}$ 0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672 4

$e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}$ 0 . 8 ×0 . 7 ×0 . 6 ×0 . 7 ×1 = 0 . 4704 5

$e_{3}^{2} e_{6}^{1}$ 0 . 8 ×0 . 9 ×1 = 0 . 72 6

$e_{5}^{3} e_{6}^{1}$ 0 . 7 ×0 . 9 ×1 = 0 . 63 7

	G	sup(g_{conby_G}, D)	sequence number
	$e_{3}^{3} e_{1}^{2} e_{2}^{3}$		0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672	1
$e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}$		$e_{2}^{3} e_{4}^{2} e_{5}^{2}$	0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672	2
		$e_{2}^{3} e_{4}^{2} e_{5}^{2} e_{6}^{1}$	0 . 8 ×0 . 7 ×0 . 6 ×0 . 9 ×1 = 0 . 6048	3
$e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}$		$e_{5}^{3} e_{4}^{2} e_{2}^{2}$	0 . 8 ×0 . 7 ×0 . 6 ×2 = 0 . 672	4
		$e_{5}^{3} e_{4}^{2} e_{2}^{2} e_{1}^{1}$	0 . 8 ×0 . 7 ×0 . 6 ×0 . 7 ×1 = 0 . 4704	5
	$e_{3}^{2} e_{6}^{1}$		0 . 8 ×0 . 9 ×1 = 0 . 72	6
	$e_{5}^{3} e_{6}^{1}$		0 . 7 ×0 . 9 ×1 = 0 . 63	7

Let min _ sup = 0 . 6 and we can get sequence number 1,2,3,4,6,7 can satisfy sup(g) ⩾ min _ sup. Since the frequency of number 3,6,7 is 1, that is to say, they occur in D only once, they shouldn’t be chosen as maximal frequent patterns (in practice we prune the branch on the lower layers). Hence the maximal frequent patterns are number 1,2,4 as shown in Fig. 7.

Fig. 7

Maximal frequent patterns with min _ sup = 0.6.

Time complexity analysis. Since each edge is probably visited, time complexity of algorithm 3 is O (n). (n denotes the number of edges in D)

5 Experiment result

In this section we present the extensive experimental results and performance evaluation for LMFP. Since each subgraph achieved from algorithms is necessarily frequent, we design experiments only to evaluate the efficiency and the scalability. We use one dataset with real uncertain edges and other three datasets with synthetic uncertain edges to exam the performance of LMFP. Then we compare it with state-of-the-art approach, namely UGRAP [27] and MUSE [25].

Our methods are implemented on a Windows 7 machine with 2.5 GHz CPU of single core and 4GB RAM. Programs are all implemented in Visual C++6.0.

5.1 Datasets

Protein-Protein Interactions (PPI). Each PPI from STRING database¹ is a protein interaction network for different organisms. Vertices represent proteins, edges represent protein-to-protein interactions and labels of edges are assigned by COG functions⁵. Since protein-to-protein interactions are uncertain, PPI is a real uncertain graph database.

Developmental Therapeutics Program (DTP). DTP contains chemical data, compound sets as well as AIDS antiviral screen data and therefore it can be used to discover and develop new cancer therapeutic agents. In these datasets nodes and edges represent atoms, bonds respectively and labels of nodes correspond to types of atoms. Probability value of edges is assigned by a random number generator, ranging from 0 to 1. The uncertain graphs in the database consist of 25 vertices and 27 edges on average.

The Cancer Genome Atlas (TCGA). TCGA dataset comprises more than two petabytes of genomic data, which helps the cancer research community to improve the prevention, diagnosis and treatment of cancer. In this database node and edge correspond to gene and relationship between 2 genes respectively. Labels of nodes correspond to types of cancer with which the gene is associated. To make it an uncertain graph database, the conditional existence probabilities of edges were synthesized following normal distribution N (0.5, 0.4). The uncertain graphs in the database consist of 80 vertices and 168 edges on average.

The Human Gene Database (GeneCards). GeneCards database provides comprehensive information on all annotated and predicted human genes, including genomic, transcriptomic, proteomic, genetic, clinical and functional information. Here node corresponds to gene and edge the functional relatedness to gene. Label of node corresponds to the protein encoded by expression of the gene. We assigned edges with probability values using a random number generator, ranging from 0 to 1. The average numbers of vertices and edges are 60 and 147 respectively.

The characteristics of these datasets above are shown in Table 3.

Table 3
Datasets Summary

Dataset name Dataset property Dataset category number of graph Average edges per graph Average nodes per graph Average existence probability value per edge per edge Number of distinct labels

PPI¹ real-world with real edge probability protein interactions for different organisms 612 368 107 0.37 960

DTP² real-world with synthetic edge probability AIDS antiviral screening dataset 1631 27 25 0.41 28

TCGA³ real-world with synthetic edge probability genomic information of cancer 825 168 80 0.38 38

GeneCard⁴ real-world with synthetic edge probability human genes dataset of predicted information 1006 147 60 0.44 870

Dataset name	Dataset property	Dataset category	number of graph	Average edges per graph	Average nodes per graph	Average existence probability value per edge per edge	Number of distinct labels
PPI¹	real-world with real edge probability	protein interactions for different organisms	612	368	107	0.37	960
DTP²	real-world with synthetic edge probability	AIDS antiviral screening dataset	1631	27	25	0.41	28
TCGA³	real-world with synthetic edge probability	genomic information of cancer	825	168	80	0.38	38
GeneCard⁴	real-world with synthetic edge probability	human genes dataset of predicted information	1006	147	60	0.44	870

As we can see, different characteristics in different database are shown as dataset property, dataset category, number of graph, average edges per graph, average nodes per graph, average existence probability value per edge and number of distinct labels, that is to say, different characteristics lie in four aspects, the authenticity of edge probability, the number of graphs, the size of the graphs and the number of distinct labels. In four datasets, PPI is the only one with real edge probability while other three with synthetic edge probability. DTP has the largest number of graphs among the four datasets while its average edges and average nodes are smallest. In TCGA The average existence probability value per edge is the highest for 0.44. GeneCard contains the largest number of distinct labels for 11254 while DTP the smallest for 28.

5.2 Results

As we know from section 4, every result of LMFP is necessarily maximal frequent pattern. So we don’t have to evaluate the precision and because we only find K maximal frequent patterns, the recall is not necessarily to be evaluated, either.

So we performed extensive experiments to evaluate the efficiency, the scalability and the memory usage. For comparison, we use 2 state-of-the-art approaches, UGRAP [27] and MUSE [25] The default parameter values ɛ, δ and min _ sup of MUSE were set to 0.1,0.1 and 0.3 respectively. UGRAP index was set to construct 2 Bloom filters per graph, covering paths of length 2 and 3. The default parameter values K and min _ sup of LMFP were set to 20 and 0.3 respectively. The experimental results are shown as below.

5.2.1 Efficiency

We first investigated the time efficiency of LMFP on the 4 databases with respect to the threshold min _ sup and the parameter K. Fig. 8(a) shows in PPI the execution time of 3 algorithms decreases when min _ sup varies from 0.3 to 0.8. This is because the higher min _ sup is, the less edges are needed for result frequent subgraph, then search time decreases naturally.

Fig. 8

Execution time vs. support threshold min _ sup for the datasets (a) PPI, (b) DTP, (c) TCGA and (d) GeneCard.

We can also find the execution time increases rapidly while min _ sup is less than 0.4, because the edges of frequent subgraph increase obviously. Similarly, the execution time decreases with min _ sup increasing in other 3 synthetic databases as shown in Fig. 8(b) (c) (d). We can find execution time of algorithm LMFP is the shortest in 4 datasets and the difference of execution time becomes more apparent while the min _ supvalue is less than 0.4.

Fig. 9 shows the execution time of 3 steps (EDCG, BSSL, SKFP) in our algorithm. As we expected, the total execution time decreases when min _ sup becomes higher. Generally speaking, step 1st takes longer time than step 2 and step 3, because more time has to be spent in it to divide edges into sets.

Fig. 9

Execution time of 3 steps (EDCG, BSSL, SKFP) vs. min _ sup.

5.2.2 Scalability

Extensive experiments are also performed to examine the scalability of LMFP. For comparison, we use scalability evaluation methodology [24], that is to say, the size of datasets grows by duplicating the uncertain graphs in the database.

Fig. 10 plots the execution time of the three algorithms with respect to the number of duplications. As we can see, the execution time increases with the increase of the number of duplications. The increase is basically linear for all algorithms, LMFP increases more slowly due to our methods of dividing edges and building search space layered (The execution times of 3 steps are respectively shown in Fig. 9).

Fig. 10

Execution time vs. number of duplications.

We also measured the memory usage. The result is shown in Fig. 11. The memory usage of 3 algorithms increases almost linearly to the increasing of the number of duplications. We can find the memory usage of MUSE is larger than the other 2 algorithms. In 3 algorithms LMFP has the smallest memory usage.

Fig. 11

Memory usage vs. number of duplications.

6 Conclusion

In this paper we present a layered and efficient mining algorithm for maximal frequent patterns in uncertain graph database. There are two key ideas behind the proposed methods. The first is to divide each edge of every certain graph (converted from equivalent uncertain graph) in its frequency to guarantee that the edge selected each time is the most frequent, avoiding a large number of candidate subgraphs. The second is to search always from upper to lower, from left to right, avoiding extensive subgraph isomorphism tests. The evaluation of our approach demonstrates the significant cost savings with respect to the state-of-the-art approach not only on the real-world dataset but also on synthetic uncertain graph databases.

Footnotes

Acknowledgments

This work was supported by General Project of Hunan Provincial Education Department (NO. 17C0397) and Youth key Project of Hunan Institute of Engineering (NO. XJ1504).

References

Asthana

, King

O.D.

, Gibbons

F.D.

and Roth

F.P.

, Predicting protein complex membership using probabilistic network reliability, Genome Research 14(6) (2004), 1170–1175.

Ghosh

, Ngo

H.Q.

, Yoon

and Qiao

, On a routing problem within probabilistic graphs and its application to intermittently connected networks, Infocom IEEE International Conference on Computer Communications (2007), 1721–1729.

Suciu

and Dalvi

, Foundations of probabilistic answers to queries, ACM SIGMOD International Conference on Management of Data (1) (2005), 125–153.

Dalvi

and Suciu

, Management of probabilistic data: Foundations and challenges, ACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (5) (2007), 1–12.

Yuan

, Wang

and Chen

, Efficient subgraph search over large uncertain graphs, Proceedings of the VLDB Endowment 4(11) (2011), 876–886.

Agrawal

and Srikant

, Fast Algorithms for Mining Association Rules, 20th International Conference on Very Large Data Bases (1994), 487–499.

Han

, Pei

, Yin

and Mao

, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.

Bhatia

and Rani

, Ap-FSM: A parallel algorithm for approximate frequent subgraph mining using Pregel, Expert Systems With Applications 106 (2018), 217–232.

Bernecker

, Kriegel

H.P.

, Renz

, Verhein

and Zuefle

, Probabilistic frequent itemset mining in uncertain databases, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009), 119–127.

10.

Chui

C.K.

, Kao

and Hung

, Mining frequent itemsets from uncertain data, Lecture Notes in Computer Science (2007), 47–58.

11.

Leung

C.K.S.

and Hao

, Mining of frequent itemsets from streams of uncertain data, International Conference on Data Engineering (2009), 1663–1670.

12.

Sun

, Cheng

, Cheung

D.W.

and Cheng

, Mining uncertain datawith probabilistic guarantees, ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 273–282.

13.

Wang

, Feng

and Wu

, Uds-fim: An efficient algorithm of frequent itemsets mining over uncertain transaction data streams, Journal of Software 9(1) (2014), 44–56.

14.

Zhang

, Li

and Yi

, Finding frequent items in probabilistic data, Proceedings of the ACM SIGMOD International Conference on Management of Data (2008), 819–831.

15.

Cormode

and Garofalakis

, Sketching probabilistic data streams, Advances in Database Technology-11th International Conference on Extending Database Technology (2008), 145–190.

16.

Sun

, Lim

and Wang

, An approximation algorithm of mining frequent itemsets from uncertain dataset, International Journal of Advancements in Computing Technology 4(3) (2012), 42–49.

17.

Wang

, Cheung

D.W.L.

, Cheng

, Lee

S.D.

and Yang

X.S.

, Efficient mining of frequent item sets on large uncertain databases, IEEE Transactions on Knowledge and Data Engineering 24(12) (2012), 2170–2183.

18.

Leung

C.K.S.

, Mateo

M.A.F.

and Brajczuk

D.A.

, A tree-based approach for frequent pattern mining from uncertain data, Lecture Notes in Computer Science (2008), 653–661.

19.

Leung

C.K.S.

, Carmichael

C.L.

and Hao

, Efficient mining of frequent patterns from uncertain data, Proceedings of IEEE International Conference on Data Mining (2007), 489–494.

20.

Lin

C.W.

and Hong

T.P.

, A new mining approach for uncertain databases using CUFP trees, Expert Systems with Applications 39(4) (2012), 4084–4093.

21.

Yuan

, Wang

, Chen

and Ning

, Efficient pattern matching on big uncertain graphs, Information Sciences 339(2016), 369–394.

22.

Chen

, Peng

, Rao

and Rosyida

, Cycle index of uncertain random graph, 34(2018), 4249–4259.

23.

Gao

, Guo

, Yin

and Yu

, The computation on α- connectedness index of uncertain graph, Cluster Computing (2017), 123–134.

24.

Zou

, Li

, Gao

and Zhang

, Frequent subgraph pattern mining on uncertain graph data, International Conference on Information and Knowledge Management, Proceedings (2009), 583–592.

25.

Zou

, Li

, Gao

and Zhang

, Mining frequent subgraph patterns from uncertain graph data, IEEE Transactions on Knowledge and Data Engineering 22(9) (2010), 1203–1218.

26.

, Zou

and Gao

, Mining frequent subgraphs over uncertain graph databases under probabilistic semantics, VLDB Journal 21(6) (2012), 753–777.

27.

Papapetrou

, Ioannou

and Skoutas

, Efficient discovery of frequent subgraph patterns in uncertain graph databases, ACM International Conference Proceeding Series (2011), 355–366.

28.

Moussaoui

, Zaghdoud

and Akaichi

, A New Framework of Frequent Uncertain Subgraph Mining, Procedia Computer Science 126 (2018), 413–422.

29.

Jiang

, Tu

, Chen

and Sun

, Network motif identification in stochastic networks, Proceedings of the National Academy of Sciences of the United States of America 103(25) (2006), 9404–9409.

30.

Valiant

L.G.

, The complexity of computing the permanent, Theoretical Computer Science 8(2) (1979), 189–201.

31.

Gaur

, Shastri

and Biswas

, Graph-Based Substructure Pattern Mining, International Conference on Advanced Computer Theory and Engineering (2008), 865–869.