Abstract
Graph pattern matching has been widespread used for protein structure analysis, expert finding and social group selection, ect. Recently, the study of graph pattern matching using the abundant attribute information of vertices and edges as constraint conditions has attracted the attention of scholars, and multi-constrained simulation has been proposed to address the problem in contextual social networks. Actually, multi-constrained graph pattern matching is an NP-complete problem and the fuzziness of constraint variables may exist in many applications. In this paper, we introduce a multi-fuzzy-constrained graph pattern matching problem in big graph data, and propose an efficient first-k algorithm Fuzzy-ETOF-K for solving it. Specifically, exploration-based method based on edge topology is adopted to improve the efficiency of edge connection, and breadth-first bounded search is used for edge matching instead of shortest path query between two nodes to improve the efficiency of edge matching. The results of our experiments conducted on three datasets of real social networks illustrate that our proposed algorithm Fuzzy-ETOF-K significantly outperforms existing approaches in efficiency and the introduction of fuzzy constraints makes our proposed algorithm more efficient and effective.
Introduction
In recent years, Big data become increasing important in acadamic and industry, It has been widely studied by scholars. Big data are generated from multi-source, heterogeneous and autonomous data [1], and usually have the characteristics of distributed and decentralized control. Furthermore, extensive application scenario makes it more changeable and huge in scale, which brings great challenges to its processing and analysis [2]. Therefore, how to represent and mine useful information become a major issue in big data research. As an important data structure, graphs are used to represent big data with complex relationships. We will call it big graph later, such as biological networks, social networks, and knowledge graphs. In order to conduct effective analysis over those graphs, various queries have been investigated, such as reachability [3], shortest path query [4], frequent subgraph mining [5, 6, 7, 8], subgraph matching [13, 10, 14, 33, 9], keyword search[11] and Graph Pattern Matching (GPM) [16, 23, 35, 41, 12], etc.
Early, for the search of graphs, researchers studied the reachability and the shortest path query in graphs. Subgraph queries for small-scale static graphs have also been extensively studied [13, 14]. However, with the continuous increase of data size and the increasing complexity of data relationships, the query and analysis of graph data are facing new challenges. Subgraph matching, which refers to finding subgraphs that is isomorphic to a given pattern graph
However, this isomorphic subgraph matching does not adapt well to some applications in social networks, such as expert finding[16], social group discovery[17], personalized recommendation [18, 19, 20, 21], ect. In order to extend the application of graph pattern matching in social networks, Fan et al. [22] proposed bounded simulation, which was based on binary correspondence between nodes and relaxes strict edge-to-edge mathcing to length bounded path matching. But they still didn’t take the rich information on the nodes and edges in the big graph data to get better query results. Therefore, Liu et al. [23] proposed Multi-Constrained Graph Pattern Matching (MC-GPM) problem based on Contextual Social Graph (CSG), which required matching more effective results by multiple constraints on vertices and edges. In addition, there are often fuzzy attribute constraints that are not considered in existing MC-GPM methods, such as the social influence of participants, the degree of trust between social participants, ect.
In this paper, we present the problem of Multi-Fuzzy-Constrained Graph Pattern Matching (MFC-GPM), which makes MC-GPM more suitable for applications in general big graph data. In addition, based on the Baseline algorithm proposed by Liu et al.[23], we propose a more efficient Edge Topologically Ordered First-K(ETOF-K) algorithm, and expand it to Fuzzy-ETOF-K alogrithm for solving MFC-GPM problem. Finally, we experiment on three real-world social networks of different sizes to compare the efficiency of our algorithm with two existing algorithms, which validates that our proposed algorithm significantly outperforms existing approaches and introducing fuzzy to MC-GPM is necessary.
The rest of this paper is organized as follows. We first review the related work on GPM in Section 2. Then, the introduction of necessary concepts and the definition of MFC-GPM are presented in Section 3. Section 4 proposes our Edge Topologically Ordered First-K algorithm. Section 5 presents our experimental sets and results, and Section 6 concludes this paper.
Related work
Graph pattern matching has developed for many years. At first, It is often used for isomorphic matching tasks, such as protein property detection [15]. It requires matching subgraphs to have the same topology structure as the pattern graph. However, the continuous increase of data scales and the increasingly complex relationship between data, the traditional graph pattern matching method based on isomorphism cannot meet the applications of emerging fields due to its high algorithm complexity. Simulation-based graph pattern matching [22, 23, 24, 35, 36] has attracted the attention of researchers in recent years because of its ability to return matching results in cubic time.
Isomorphism-based GPM
Isomorphism-based graph pattern matching is to find subgraphs that are isomorphic to the pattern graph, it is also called subgraph matching. The earliest Ullmann [13] proposed a depth-first-based matching algorithm, which is a brute-force enumeration search method. In order to improve the efficiency of the algorithm, Cordella et al. [14] proposed the VF2 algorithm by improving the pruning strategy of the Ullmann algorithm. More pruning strategies based on pattern graph semantics and structure in [25, 26] were studied. But isomorphic graph pattern matching is an NP-complete problem, in order to further improve the efficiency of matching, Yan et al. [27] proposed a GIndex algorithm based on frequent subgraph mining and indexing. This algorithm significantly improves the performance of graph pattern matching. Path-based indexing algorithm was also proposed by Shasha et al. [28]. More approaches about index-based matching can be found in the literature [29, 30, 31].
In recent years, with the rapid growth of data scale and the increasing complexity of data relationships, the above mentioned methods can not adapt to the exact matching of pattern graph on very large scale graph data. Afrati et al.[32] proposed a distributed parallel computing method based on map-reduce. They decomposed the pattern graph into serveral subgraphs, and then obtained the final matching subgraphs through distributed computing and join operations. However, the connection operation is very computationally intensive, which seriously affects the efficiency of the matching. Shao et al.[33] proposed a parallel computing framework PSgL which obtained matching results by iterative computation of intermediate matching results. In addition, incremental-based matching [18, 26, 30] and top-k algorithms [34] have also been studied in response to the dynamic growth of graphs and the need for real-time applications, respectively.
Simulation-based GPM
Although the graph pattern matching based on isomorphism can be used to detect isomorphic structures of chemical substances and predict the interaction between proteins [15], and it can also be applied to expert discovery [16], social group query [17], personalized recommendation[18, 19], etc. However, isomorphic-based matching is too strict for the applications of social networks. Henzinger et al.[24] proposed graph simulation, which obtained a set of simulated nodes by calculating the similarity of nodes, and the computation can be completed in polynomial time. But this is still based on edge-to-edge matching, and its flexibility does not meet the needs of new applications such as in social networks. Fan et al.[22] proposed bounded simulation, which requires that the edges satisfy the bounded length path matching and can be completed in cubic time. Then, in order to adapt to specific applications, some improved methods based on bounded simulation have been proposed, such as graph pattern view[35], resource-bounded query[36] ect. Moverover, a strong simulation[37] is proposed for matching the pattern graph topological structure that was not well retained by bounded simulation. Exactly, strong simulation uses the duality to ensure the topological relationship between node and it’s adjacent points, and then further secure it with the locality of the pattern graph. However, all the existing work has failed to use the abundant vertices and edges information contained in the big graph data. Shemshadi et al.[38] considered multi-label information on vertices but did not consider the constraints on the edges. Liu et al.[23] proposed the MC-GPM problem, put constraints on nodes and edges to match more accurate results, and proposed a Baseline algorithm based on exploration-based method and a heuristic algorithm(HAMC) based on compressed index of data graph. MC-GPM is useful for contextual social networking applications, such as crowdsourcing travel [39, 21, 39], social network based e-commerce[40], ect. Shi et al. [41] and Liu et al.[42] also studied the top-k algorithm and the parallel algorithm. However, these tasks have not substantially improved the performance of the Baseline algorithm and the HAMC algorithm, but only use more computing resources in exchange for shortening the matching time. In this paper, we will propose an ETOF-K algorithm to improve the performance of the algorithm from the aspects of edge matching and edge joining, respectively. And the general mode of multi-fuzzy-constrained graph pattern matching in big graph data is proposed.
Multi-fuzzy-constrained graph pattern matching (MFC-GPM)
In this section, first, we will define of graph and the pattern graph in the pattern graph matching. Then the definition of the graph simulation is given. Finally, we will introduce our proposed MFC-GPM based on graph simulation.
Data graph and pattern graph
Data graph
A data graph is a labeled directed graph and can be represented by
A pattern graph is defined as a labeled directed graph and is denoted as
Data graph and pattern graph in a CSG.
where
Graph Simulation proposed by Henzinger et al.[24]. They used it to calculate the similarity of the nodes for verification and refinement of the reaction system. Fan et al.[22]improved the graph simulation, proposed bounded simulation, and applied it to graph pattern matching. The definition of the graph simulation is as follows:
Consider a data graph
for all for each pair
for each edge
Then
It should be noted that this type of graph simulation matching results not only include nodes that match the vertices in the pattern graph, but also include nodes on the edge matching path. In addition,
The simulation-based graph pattern matching is more flexible than isomorphism-based and can complete matching queries in polynomial time, so it is more suitable for social group discovery[22], personalized recommendation [18, 19, 20, 21], etc. in social networks with higher real-time requirements. However, big graph data usually contains a large amount of vertex and edge information, which can be used to match more accurate results. Liu et al. [23] proposed MC-GPM problem and proposed two matching method based on MCS. MC-GPM can be used for crowd-sourcing travel [39], social network-based e-commerce [40], etc. However, in the practical application of big graph data, there are usually constraints that are inherently fuzzy, such as the social influence on the vertex and the social trust on the edge in a CSG. Obviously, MC-GPM does not take this fuzzy property of constraints into account. As a result, some actual useful match results be lost. For example, the results that only one attribute is slightly below the constraint and other constraints are well satisfied are lost. Therefore, in this paper, the MFC-GPM problem is proposed, in which all constraints are set to a membership function. For constraints without fuzzy property, set its membership value to be equal to 1. Moreover, we can set different membership functions for different constraints. But for the sake of illustration, in this paper, all constraints are set to the same membership function.
When a vertex
Consider a data graph
for all for each pair
for each edge
then
In this section, the ETOF-K algorithm will be introduced first. Then, the multi-constrained edge matching and the topologically ordered edge connection are explained in detail.
Methodology
MC-GPM is an NP-complete problem. In order to solve this problem effectively, Liu et al.[23], proposed two First-K algorithms based on MCS. First, they proposed a Baseline algorithm based on bounded simulation, which performs the edge connection based on the depth-first traversal of the graph, and the edge matching is performed by the shortest path query between nodes. Then they proposed a heuristic algorithm (HAMC) based on the compression and indexing of strongly linked subgraphs (they called Strong Social Component, SSC) in data graph, which also uses the depth-first traversal method of the graph for edge connection. But for the edge matching, the query is first performed in the SSC index. If it exists, the edge matching result is returned and if it does not exist, the Dijkstra algorithm is used to search in the data graph.
In this paper, An Edge Topologically Ordered First-K algorithm has been proposed based on Baseline algorithm. First, ETOF-K algorithm performs edge matching by breadth-first bounded search based on the current matching vertex, instead of matching the shortest path between two vertices. This will greatly reduce the query of the shortest path between unnecessary vertices. Second, the edge topologically ordered sequence is used to edge connection processing instead of the graph depth-first search. Because when we match an edge, if there is still an edge pointing to its starting vertex that has not yet matched, then the current matching may be invalid. Therefore, the topologically ordered edge join method can achieve pruning by filtering the vertices before matching the edges starting from it.
Pattern and Data Graph for Illustrating Edge Connection (Note that we omit the attributes and attribute constraints on the nodes and edges in the graph, which are the same as in Fig. 1).
For fast return matching results, the ETOF-K algorithm proved to be very effective by our experiments. It is a process of edge-joining while edge matching. First, we need to get the indegree of all the vertices in the pattern graph, and then use the topological sorting algorithm to get the order of accessing the edges. We call it the topologically ordered sequence of edges. Then we perform edge matching and join operations based on this sequence.
Multi-constrained edge pattern matching is the main part of our algorithm. It is usually a query in the data graph to match the edges or paths that satisfy all the constraints defined on the edges in the pattern graph. In practical applications, it usually means to query whether a set of specified conditions can be satisfied between two participants. In this paper, A pattern edge matching algorithm based on candidate node breadth-first depth bounded search is proposed. First, we conduct a breadth first depth bounded search from a candidate vertex
[h] Muti-Constrained Edge Pattern Matching, MC-EPM
The detailed steps of the algorithm are shown in Algorithm 1, where path
For different applications, the multiple constraints defined on the edge may be different. For example, in CSG, the multiple constraints defined on the edges are often social influence factors, social trust, and social relationships; in medical big data, it can be reliability determined by the doctor’s experience, data quality determined by data sources and data imbalance, ect. In addition, for our proposed MFC-GPM, whether a path satisfies multiple constraints will be determined by its membership function and corresponding membership value defined on different constraints, such as Example 2.
Exploration-based topologically ordered edge connection
Our proposed ETOF-K algorithm is a exploration-based method for graph pattern matching, which aims to quickly answer a GPM query. The matching process can be described in two parts. In the previous section, we have already introduced the edge matching process. Therefore, in this section, the edge connection method will be introduced. For our proposed MC-GPM algorithm ETOF-K, before performing edge matching and connection, it is necessary to obtain the indegree of all vertices indegree[
[tb] ETOF-K Algorithm
In Algorithm 2,
[tb] Edge Connection, EC
In Algorithm 3, the matching results
Experiments
In this section, we conduct experiments on three real-world social graphs. The details of the three datasets are shown in Table 1. The Baseline and HAMC algorithm have been implemented based on source code of Liu et al.[23] and an algorithm, we call it the Baseline2 algorithm, has been implemented by using our edge matching method instead of the original one of the Baseline algorithm. Then, our proposed ETOF-K algorithm has been implemented by ourself. Finally, we compare the efficiency of the Baseline, HAMC, Baseline2 and ETOF-K algorithms. The effectiveness of introducing fuzzy constraints into MC-GPM is proved by comparing the ETOF-K algorithm and the Fuzzy-ETOF-K algorithm.
The detail information of three real-world data graph
The detail information of three real-world data graph
The number of matching results 
The datasets used for our experiments contain only vertices and edges, which are available at snap.stanford.edu. The aggregated attributes of the vertices and edges mentioned earlier can be mined from the existing social networks, which is another very challenging problem. In our experiments, without loss of generality, we randomly generate them with the function rand() in SQL. In Exp-1, the constraint
Since MC-GPM is an NP-complete problem, it is impossible to get all the matching results in
The number of matching results 
The number of matching results 
Exp-1
This experiment is to investigate the efficiency of our ETOF-K algorithm by comparing the number of matching results returned in the same time of four algorithms.
| T |
T |
T |
T |
T |
T |
T |
T |
T |
T |
T |
T |
T |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3429 |
| HAMC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3429 |
| Baseline2 | 22334 | 48498 | 77008 | 105241 | 132615 | 156850 | 189364 | 218323 | 243366 | 271907 | 296954 | 329934 | 354038 |
| ETOF-K | 62543 | 147842 | 261741 | 359301 | 491191 | 637043 | 744322 | 876327 | 1004723 | 1124644 | 1171168 | 1247555 | 1347302 |
| T 1400 | T 1500 | T 1600 | T 1700 | T 1800 | T 1900 | T 2000 | T 2100 | T 2200 | T 2300 | T 2400 | T 2500 | ||
| Baseline | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 5322 | 5322 | 5322 | 5322 | |
| HAMC | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 3429 | 5322 | 5322 | 5322 | 5322 | |
| Baseline2 | 382500 | 400533 | 416104 | 444975 | 470518 | 498989 | 529692 | 558707 | 586977 | 618085 | 646753 | 673075 | |
| ETOF-K | 1449619 | 1537764 | 1668963 | 1827056 | 1955824 | 2007173 | 2073885 | 2150678 | 2224766 | 2290701 | 2377292 | 2464516 |
Details of the number of results returned at different time points on the DBLP dataset
Details of the number of results returned at different time points on the Pokec dataset
As shown in Figs 3–5. (1) The Baseline and HAMC algorithms return almost the same number of results at all statistical time for all three datasets. This is because the matching process between the HAMC algorithm and the Baseline algorithm is almost the same, except that the HAMC algorithm can speed up the matching process of the partial edges by using the compression and indexing of the SSC subgraph. However, as a whole, the HAMC algorithm first queries the index file when performing edge matching, so the overall efficiency is not as good as the Baseline algorithm. (2) Baseline2 algorithm that replaces the shortest path query’s edge matching method with our edge matching method can get more matching results than Baseline and HAMC algorithms at same statistical time. This proves that our vertex-based breadth-first bounded edge matching method is more efficient than the previous edge matching method. (3) And the number of matching results obtained by our ETOF-K algorithm is much higher than the previous three algorithms at three datasets. This proves that the edge matching process based on edge topologically ordered sequences can effectively prun the edge matching process, thus improving the matching efficiency.
In addition, the statistics of the specific number of matching subgraphs in Tables 4–4.
This experiment is to investigate the efficiency and effectiveness of introduce fuzzy constraints to our ETOF-K algorithm. Due to the fact that we cannot get all match subgraph, the efficiency of the two algorithms is compared by using the method in Exp-1 and the effectiveness of two algorithms is compared by the number of results that concerned to the same first edge in the topological ordered sequence of pattern edges after running 2500 minutes.
As shown in Figs 6 and 8, on the Epinions and Pokec datasets, the Fuzzy-ETOF-K algorithm can return more matching results than the ETOF-K algorithm at all the same statistical times. On the DBLP dataset, the number of results returned by the Fuzzy-ETOF-K algorithm and the ETOF-K algorithm is similar in the first 30 hours, and the ETOF-K algorithm is superior to the Fuzzy-ETOF-K algorithm, as Fig. 7. This is because fuzzy constraints require membership calculation for each constraint condition, so the access of Fuzzy-ETOF-K algorithm to candidate nodes is slightly slower than that of ETOF-K algorithm as a whole. These experimental results prove that although Fuzzy-ETOF-K algorithm is not better shown in than ETOF-K algorithm in all cases, but it is not much worse than ETOF-K algorithm even if it is poor. This also proves that the influence of the fuzzy calculation on the matching efficiency is slightly less than the effect of the effective matching result discard on the matching efficiency.
In addition, as shown in Table 5, after 2500 minutes of algorithm execution, on the Epinions dataset,
The number of matching results 
The number of matching results 
The number of matching results 
The comparison of the number of results between ETOF-K and Fuzzy-ETOF-K on three datasets
we can get the two matching results of the first edge and all the 136626 matching subgraphs associated with the two matching edges by ETOF-K algorithm and 175289 matching subgraphs associated with the same two matching edges by introducing fuzzy into ETOF-K algorithm. It is 28.90% more effective than ETOF-K. Similarly, we can see that the Fuzzy-ETOF-K algorithm is 20.83% and 31.16% more efficient than ETOF-K on the DBLP dataset and on the Pokec dataset, respectively. This proves that it is necessary and more effective to introduce fuzzy constraints into MC-GPM.
In this paper, we presented a Multi-Fuzzy-Constrained Graph Pattern Matching (MFC-GPM) problem and then proposed a Fuzzy-ETOF-K algorithm to solve it. For Multi-Constrained Graph Pattern Matching (MC-GPM) problem, we proposed an ETOF-K algorithm to improve its efficiency and conducted validation experiments on three real social network datasets by comparing Baseline, HAMC, Baseline2 and ETOF-K algorithms. Furthermore, we proved the necessity and effectiveness of introducing fuzzy constraints into MC-GPM problem by comparing ETOF-K and Fuzzy-ETOF-K algorithms.
In the future, we will further study and improve the Fuzzy-ETOF-K algorithm, and consider the dynamic graph and super-large graph pattern matching problem in combination with parallel computing and distributed computing techniques. Research combined with practical applications is also under consideration.
Footnotes
Acknowledgments
This work has been supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000901, the National Natural Science Foundation of China under Grant No. 91746209, and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education under Grant No. IRT17R32.
