Abstract
We propose an efficient, approximate algorithm to solve the problem of finding frequent subgraphs in large streaming graphs. The graph stream is treated as batches of labeled nodes and edges. Our proposed algorithm finds the set of frequent subgraphs as the graph evolves after each batch. The computational complexity is bounded to linear limits by looking only at the changes made by the most recent batch, and the historical set of frequent subgraphs. As a part of our approach, we also propose a novel sampling algorithm that samples regions of the graph that have been changed by the most recent update to the graph. The performance of the proposed approach is evaluated using five large graph datasets, and our approach is shown to be faster than the state of the art large graph miners while maintaining their accuracy. We also compare our sampling algorithm against a well known sampling algorithm for network motif mining, and show that our sampling algorithm is faster, and capable of discovering more types of patterns. We provide theoretical guarantees of our algorithm’s accuracy using the well known Chernoff bounds, as well as an analysis of the computational complexity of our approach.
Introduction
With the proliferation of sources that produce graph data such as social networks, citation networks, civic utility networks, networks derived from movie data, and internet trace data, it has now become important to design unsupervised learning algorithms for real world graph data. Real world graph data is dynamic, sometimes evolving at very high speeds, and is rich with information at the nodes and edges. Unsupervised pattern discovery algorithms allow the discovery of important structural motifs in the data. In this paper, we address the problem of finding subgraph patterns with frequency above a user-defined threshold (formally called the frequent subgraph discovery problem).
Frequent subgraph mining has many applications. Frequent subgraph patterns define normative substructures in a graph, thereby allowing summarization of the graph. A graph can be modeled as a distribution on its component frequent subgraphs. Frequent subgraph sets can then be used as generative models for large networks [20]. Frequent subgraphs can be used to compress graphs by computing a frequency based compression metric and replacing the subgraphs with the highest values by a single node. We can use frequent subgraphs to monitor the vulnerability of a cyber-network by inspecting the change in the set of frequent subgraphs over time.
While useful, the problem of finding frequent subgraphs computationally lies in the space of NP-Hard. This is because a computationally complete solution of frequent subgraph mining for a single large graph requires solving the maximum independent set counting problem which is NP-Hard. While this can still be tractable for static graphs, it quickly becomes intractable for dynamic graphs, especially, if we model the graph’s evolution as a series of static snapshots. Given that most real world graphs densify over time [21], we assume that our graph data has streaming updates in the form of node and edge additions. Also, since most real world graphs have data associated with the nodes and edges, we allow the nodes and edges to be labeled.
We explain the scenario in which our system operates as follows. We assume the graph grows in batches of updates, where each update consists of new nodes and edges that are being added to the network. We do not allow the presence of multi-edges. As each batch of updates is added to the graph, our approach first samples regions of the graph being changed, and then estimates the frequent subgraphs present in the entire graph as each batch of updates is added to the graph. A crucial requirement for our approach is that it reports the frequent subgraphs present in a timely manner. The contribution of this work is an approach that samples regions of change, and discovers frequent subgraphs in large graphs with streaming updates to the graph. The sampling approach thus makes the subgraph isomorphism problem more tractable.
The rest of this paper is organized as follows. In the next section, we discuss the related work. Then in the following section, we define some of the terms related to frequent subgraph mining that we use through the rest of the paper. We also describe our problem statement. After that, we describe the intuition behind our proposed approach called StreamFSM, as well as provide a pseudocode version of the algorithm. We then discuss the datasets that we have used to evaluate our approach. Then we present the evaluation and the experimental results. Finally we summarize the paper and discuss our future work.
Related work
Previous research efforts have mostly been concentrated on developing frequent subgraph mining algorithms for static graphs. Frequent subgraph discovery algorithms can be categorized into either complete or heuristic discovery algorithms. Complete algorithms like SIGRAM [18] find all the frequent subgraphs that appear no less than a user specified threshold value. Heuristic algorithms like SUBDUE [10] discover only a subset of all frequent subgraphs. The first work done on discovering frequent subgraphs in single large graphs was done by Cook and Holder [10] and the algorithm was called SUBDUE. The focus of the algorithm however was in discovering the subgraph which compressed the input graph the most. However, parameters of this algorithm could be tuned to make it output frequent subgraphs as well. SUBDUE is a heuristic discovery algorithm. SIGRAM developed by Kuramochi and Karypis [17] is a complete discovery algorithm which finds frequent subgraphs from a large sparse graph.
The first frequent substructure based algorithm was designed by Inokuchi et al. [14] and was inspired by the Apriori algorithm for frequent itemset mining [32]. Designed for the graph transaction scenario, it was called AGM and the basic idea behind it was to join two size ‘
Some work has also been done on subgraph mining for dynamic graphs or streaming graphs. Bifet et al. [6] compared several sliding window approaches to mining streaming graph transactions for closed frequent subgraphs using a core set of closed frequent subgraphs as a compressed representation of all the closed frequent subgraphs discovered in the past. Aggarwal et al. [1] propose two algorithms for finding dense subgraphs in large graphs with streaming updates. However they assume that the updates to the graph come in the form of edge disjoint subgraphs. Wackersreuther et al. [27] proposed a method for finding frequent subgraphs in dynamic networks, where a dynamic network is basically the union of time based snapshots of a dynamic graph. Our work is distinguished from all these works as we attempt to find subgraphs in a large graph which has streaming updates.
Berlingerio et al. [5] devise a way to find graph evolution rules, which are patterns that obey certain minimum support and confidence in evolving graphs. They propose an algorithm called GERM that finds evolution rules from the frequent patterns present in the graph. However, in their approach they assume that they have the entire dynamic graph in the form of snapshots, where each snapshot represents the graph at a certain point of time. This is distinguished by our work where we do not have the entire graph all at once, and only see batches of updates to the graph. One of the challenges that Berlingerio et al. [5] face, similar to ours, was that the presence of high degree nodes in real world graphs greatly increases the processing time of the transaction mining component. They use a user-defined parameter to restrict the number of edges in a pattern. We take a similar approach to theirs: restricting the proportion of the neighborhood that is sampled by using a user-defined parameter to restrict the degree of the nodes.
Frequent subgraphs have been applied in solving related problems in dynamic networks. Lahiri and Berger-Wolff [19] apply a slightly modified concept of FSG, and use it as an interestingness criteria to extract regions of graphs from a stream of graph transactions (which represent snapshots of a dynamic graph), which they then use to predict interactions in the whole network. The definition of frequent subgraph is slightly different in this paper as they constrain vertex labels to be unique. They have converted the problem of FSG into a supervised learning problem, which does not really discover new and unseen FSGs in later stages. In our scenario, however, we continuously discover frequent subgraphs as batches stream in. Unlike the approach taken by Lahiri and Berger-Wolff, our approach can detect changes to the set of frequent subgraphs in the graph. In [34], Zhu et al. propose SpiderMine to probabilistically find the top k-large frequent subgraph patterns from a single large networks. Their focus is more on finding larger frequent subgraphs patterns. Our work on the other hand focuses more on finding frequent subgraphs in dynamic graphs, and our approach finds both large and small patterns depending on the parameters in the sampling phase. More recently, Braun et al. [8] propose a novel data structure called DS-Matrix, that allows computation of frequent patterns in a dense graph stream in a single pass. Their approach, however, requires that the possible edges and nodes be known before-hand, and cannot handle an infinite stream of nodes and edges. Our approach, unlike theirs, is an online approach which can handle an infinite stream.
Kashtan et al. [16] propose a sampling method that randomly samples n-nodes in order to extract connected subgraph samples that are of order ‘n’. Their approach consists of repeatedly sampling subgraphs of order ‘n’ until the desired sample size is reached. The concentration of different subgraphs are then estimated, which is used to find motifs in the network. The StreamSampleNeighborhood sampling scheme that we use in our approach uses a snowball sampling like scheme that is better for discovering star-shaped patterns in addition to other types of patterns. Wernicke [28] modified the sampling scheme due Kashtan et al. [16] in order to correct the sampling bias. The TIES algorithm [2] also used a sampling that is similar to ours. However, the objective of TIES is to sample a subgraph that is representative of the entire graph. In contrast, our sampling scheme samples a subset of the neighborhood around an incoming edge that is representative of the neighborhood.
Several graph analytics frameworks have been developed to support parallel/distributed processing of big data graphs [4]. Kang et al. [15] introduced PEGASUS, a big graph mining system built on top of MapReduce. PEGASUS was able to compute various properties of billion-node graphs using the distributed and scalable matrix-based capabilities of the MapReduce-Hadoop platform. Malewicz et al. [24] developed Pregel, which takes a distributed, vertex-centric approach to graph processing where each vertex can pass messages to its neighboring vertices. Low et al.’s GraphLab [23] is also a vertex-centric approach, but is targeted toward shared memory architectures. GraphLab supports an update function based on a vertex and its neighbors that can update the data associated with these vertices. Xin et al.’s GraphX [29] is built on top of the Spark distributed processing platform. GraphX takes an edge-centric approach by representing graphs as a set of edges in Spark’s Resilient Distributed Dataset (RDD). Finally, Yan et al. [30] developed the Blogel framework that takes a block, or subgraph, centric approach to big graph analytics, where processing involves distributed message passing between blocks. All of these frameworks support fast processing on large graphs, but tend to be optimized for finding global graph properties and require the entire graph to be loaded into the platform (i.e., no streaming).
Still, some specific approaches have been built on top of these frameworks to solve the frequent subgraph mining problem on big data graphs. Liu et al.’s MapReduce-based Pattern Finding (MRPF) method [22] and Hill et al.’s approach [11] use multiple MapReduce passes to find subgraph pattern extensions and their frequencies. Aridhi et al.’s approach [3] also utilizes MapReduce, but after first partitioning the graph, then mapping partitions to local frequent subgraphs, and then reducing these results to a set of globally frequent subgraphs. While fast and able to handle big data graphs, these approaches still require the entire graph be available. Carefully designed streaming graph processing can overcome this limitation while still scaling to large graphs.
Definitions
In this section we define the notation and terms related to our problem, and thus formally define the problem of finding frequent subgraphs. It is important to note that we do not allow multi-edges, even if the multi-edges have different labels. Keeping this in mind, we define the following terms:
In a streaming graph, a vertex is defined as a pair,
In a streaming graph, an edge is defined as a 4-tuple
A graph is defined as a 4-tuple,
Now that we have defined the basic graph data type we will be using, we move on to definitions for dynamic graphs that are more specific to our application, including what defines a batch of updates and how batches of updates taken together define the streaming graph.
Batch, graph snapshot, and graph stream.
A batch of node/edge updates to a streaming graph can be defined as
A graph stream
Conceptually, every batch of updates adds new nodes and/or new edges to the graph under consideration. Our approach is designed to handle an infinitely evolving graph, hence it can theoretically accept an infinite number of batches of updates. We do not consider the situation in which nodes/edges are deleted or modified. This is because keeping track of changes to the count of a subgraph in case of deletions/modifications will require storage of every instance of a subgraph that contributes to the count. This would increase the space complexity of our approach. In Fig. 1, we see the graph stream
A graph snapshot
For instance, in Fig. 1, the snapshot
Now that we have defined graph streams and snapshots and the general structure of the graph data, we move on to defining the graph patterns that we are looking for.
A subgraph pattern of a streaming graph snapshot is defined as
A subgraph pattern
A canonical code of a graph G, denoted by Canon(G) is an ordering of the nodes and edges of the graph such that if two graphs
Computing the canonical code for a graph is at least as computationally hard as computing the graph isomorphism. However, once computed, the code can be stored in memory so that recurrent isomorphism tests are reduced to string matching.
The problem of finding frequent subgraph patterns exists in two different graph data type settings. One type is where the graph data represents a single large graph. In this scenario, frequency of a subgraph pattern is defined as the maximum number of non-overlapping instances of the pattern. It is necessary to count only the non-overlapping instances in order to maintain the downward closure property of subgraph counts. The other scenario is where the graph data represents a collection of smaller edge-disjoint graphs called graph transactions. In this scenario, the frequency of a subgraph pattern is defined as the number of graph transactions that it is present in. Multiple occurrences of a pattern are counted only once.
A set of graph transactions, or database of graphs is defined as
A frequent subgraph
The problem of finding frequent subgraphs in a streaming/dynamic graph can therefore be formally defined as the problem of continuously finding the set of all frequent subgraphs
Subgraph isomorphism is NP-Complete, hence our approach uses a heuristic sampling algorithm to make subgraph isomorphism more tractable.
While the problem of finding frequent subgraphs in a large static graph is hard, continuously finding frequent subgraphs in graphs with streaming updates is harder. This is because the set of frequent subgraphs has to be updated as each batch of node/edge updates arrives. In most domains, such as social and cyber networks where the graphs mostly grow in size, a snapshot based approach of finding frequent subgraphs would not be efficient. This is because in a snapshot based approach we would be performing a lot of redundant computation. An incremental pattern growth based approach would be inaccurate as the frequency of the subgraphs discovered, towards the beginning of the graph’s lifetime, will not be very high. Many subgraphs will gradually increase in frequency over time, and such subgraphs will be eliminated from the initial set of frequent subgraphs. Thus a complete approach would require re-computation of frequent subgraphs with every new batch of updates to the graph, and would be very inefficient.
The intuition behind our approach is that as each edge arrives in the stream of edges, it makes a change to the distribution of the frequent subgraphs only in its local neighborhood. Therefore we first propose a novel sampling algorithm for sampling the neighborhood of an incoming edge. The pseudocode for our sampling algorithm is given in Algorithm 4. This sampling is done in order to extract neighborhoods around the incoming edge, and is performed after the current batch of edges have been added to the graph. The sampling is seeded at the new edges. If a new edge
Graph
In our approach the maximum number of nodes to select, adjacent to the nodes of the incoming edge, is dependent on the domain. Domains that have a high frequency of large star-shaped patterns, where the edges in the star have similar labels, require some pruning. In other domains we may be able to select all the nodes in the immediate neighborhood of the edge. This pruning is done by tuning the variable
StreamSamplingNeighborhood on sample graph with 
Set of node, edge updates UFreq. threshold
These extracted neighborhoods now become graph transactions, which are input to a graph transaction miner like gSpan [31]. This sampling scheme of course affects both the accuracy and the execution time. Selecting a higher number of neighbors increases the running time and the accuracy, while selecting a lower number of neighbors does the reverse. The graph transaction miner will find the frequent subgraphs, their frequencies, as well as their canonical labels. We use the canonical labels as keys to store the subgraphs and their current frequencies in a hash table data structure.
We repeat the procedure detailed in the above paragraph for every batch of node and edge updates to the graph. The hash table is updated with the counts of the subgraphs that are frequent for each batch of updates. Subgraphs that have been discovered previously have their counts incremented by their frequency in the set of graph transactions derived from the current batch of updates. Newly discovered subgraphs have their counts initialized in the hash table by their currently discovered frequency. After every batch is processed the subgraphs that are frequent so far are reported.
There are two important points to note at this stage. First, in the single large graph setting the frequency of a subgraph is the number of non-overlapping instances of that subgraph in the large graph. However, in the graph transaction setting, frequency is defined as the number of transactions that contain at least one instance of that subgraph. Hence it is difficult to translate the notion of a subgraph being ‘frequent’ in a single large graph to that subgraph being ‘frequent’ in a set of transactions that only reflect certain regions of the graph. Second, certain subgraphs may become frequent over time as the graph evolves. These subgraphs would be frequent for the overall graph but not be frequent in any one set of transactions. In order to address these issues, we use a very low frequency threshold for our graph transaction based frequent subgraph miner. While that increases our space requirements, it ensures that we do not miss subgraphs whose frequency increases gradually. We call our algorithm StreamFSM and the pseudo-code of the StreamFSM algorithm is given in Algorithm 2.
In the following lines we explain the pseudo-code of the algorithm. First, we initialize the graph
Before we evaluate our algorithm empirically, we discuss certain theoretical results on the accuracy and speed of our approach. Hence, in this section we first present theorems on the structural limitations of the patterns that can be discovered. Next we derive an expression for accuracy from the Chernoff bounds, and finally we discuss the time complexity of our algorithm.
Analysis of sampling algorithm
With regards to the StreamSampleNeighborhood algorithm, we have the following theorems on the limitations of the patterns that can be discovered by our approach. We focus on finding bounds for the diameter and degree as these two metrics are representative of the size and density of a subgraph. We need these theorems to prove the structural limitations of the patterns found using the StreamFSM algorithm.
.
The minimum possible diameter of the discovered patterns is
Proof..
The smallest discovered pattern can be a single vertex. Hence, the minimum diameter is
.
The lower bound on the maximum diameter of the discovered patterns is
Proof..
The sampling algorithm selects the incoming edge and M edges from both end-points of the incoming edge. Hence the lower bound on the maximum diameter of the discovered patterns is 3, provided each neighbor has at least one neighbor. ∎
.
The upper bound on the maximum diameter of the discovered patterns is the length of the longest Hamiltonian path on the sampled graph. Hence the maximum diameter is greater than or equal to
Proof..
The longest Hamiltonian path covers every vertex exactly once. If a chain graph is formed from this path, the diameter of the chain will be the length of the chain. No longer path, which includes every vertex exactly once, is possible in this graph. Hence, this will be the upper bound on the maximum diameter of a pattern discovered from this sample. If the graph sample is a Hamiltonian graph, then this diameter will be
Hence the upper bound on the maximum diameter of the discovered pattern is
.
The maximum degree of a node in a subgraph pattern can be
Proof..
The node that connects to every other node in the graph transaction will have a degree of
The structures that can be possibly discovered by StreamFSM are limited by the above theorems. The limitations are not too restrictive and we will be able to discover star graphs, cyclical graphs, line graphs, and chain graphs, as well as other arbitrary shapes. However, there is a restriction on the diameter of the graph and degree of the node which is due to the sampling. Also because we only sample one hop, there is a chance we will not be able to discover bushy tree patterns. If we want to discover larger shapes, then the value of
We use the well known Chernoff bounds to provide accuracy bounds for the StreamFSM algorithm. There are two cases when sampling may be required. In the first case, if the batches get too large, then StreamFSM may have to drop certain edges in order to maintain reasonable processing times. Thus we have Case 1.
The Chernoff bound tells how close the actual count of an edge in the sample is to the expected count in the sample. This is defined as accuracy of the sample and is given by
As in Case 1 above, we get the lower bound of the confidence value from Eq. (4),
where
where
From Eq. (7), we can get the accuracy for a certain desired confidence level. The expression for that is given as,
For the HETREC dataset, with
Let the number of batches be
where
A DFS min code is the smallest DFS code of a graph in lexicographical order.
Depending on the characteristics of the graph, and the parameters set for StreamFSM, different terms in the time complexity expression would be dominant. The longer we process the graph stream, the value of
There are two possible cases in terms of worst case behavior. First, considering the worst case value of
While the runtime of StreamFSM will depend on the gSpan frequency, it can also depend on the size of the batch that we are processing. Individual batches will be processed much faster than the entire graph. Hence this will allow StreamFSM to keep up with the stream-rate. While processing the graph one batch at a time,
Known substructures embedded in (a) Artificial graph 1, and (b) Artificial graph 2.
This section describes the datasets that we use to evaluate the efficiency and accuracy of our approach. We use five large graph datasets. The first two are artificial graphs, and the final three are real-world datasets. The artificial graphs are created using a graph generator called Subgen that takes as input a pre-defined subgraph, the percentage coverage of the subgraph, the order and size of the desired graph, and distributions of the node and edge labels. It then creates the artificial graph with the number of instances required to meet the desired percentage coverage of the graph, and then randomly connects all the instances. The artificial graphs contain multiple occurrences of a known frequent subgraph. The size and order of the artificial graphs is chosen to ensure the correct proportion of instances of the known frequent subgraphs. We choose a large number of embeddings (83%) in order to make sure that any patterns that might be formed by the random patterns have a low frequency. The labels for the random edges are chosen from a pre-defined distribution over the nodes and edges. The pre-defined distribution mirrors the distribution of the labels in the embedded subgraph. After creating the artificial graphs, we randomize the edges of the graph in order to make sure that subsequent edges in the graph do not form instances of frequent subgraphs. We then partition the edges serially into batches as described below. The partitioning can result in breaking a frequent subgraph pattern across batches. One of the reasons for choosing a high frequency is so that the embedded subgraph pattern can have high frequency even after losing some instances to the partitioning.
Artificial graph 1: We created an artificial graph with 10,000 vertices and 15,000 edges where instances of the substructure in Fig. 3a comprise 83% of the graph, and the rest of the edges randomly connect the instances of the substructure. We choose 83% coverage so that the embedded pattern remains the dominant pattern after Subgen has added random edges. No other models of noise were added. The graph is then divided into 15 parts containing approximately 1000 edges each. The choice of 15 batches is simply to keep the number of edges close to 1000 in every batch. Each part represents a batch of edge and vertex updates to the graph.
Artificial graph 2: We created an artificial graph with 10,000 vertices and 16,911 edges where instances of the substructure in Fig. 3b comprise 83% of the graph, and the rest of the edges randomly connect the instances of the substructure. We choose 83% coverage so that the embedded pattern remains the dominant pattern after Subgen has added random edges. No other models of noise were added.The graph is then divided into 16 parts containing approximately 1000 edges each. The choice of 16 parts is simply to keep the number of edges close to 1000 in every batch. Each part represents a batch of edge and vertex updates to the graph.
Twitter: The raw Twitter4
Sample of Twitter graph.
Hetrec: The Hetrec 2011 [9] dataset uses the MovieLens 10M5
ArnetMiner: The ArnetMiner citation dataset [26] is a citation network dataset with 629,814 papers and more than 632,752 citations. We create a graph where the nodes can be either of ‘paper’, ‘author’, or ‘venue’ types. As with the HETREC dataset, the nodes represent unique papers, authors or venues. Their labels however are not unique. The edges can be either ‘paper to author’, ‘paper to venue’, and ‘paper to paper’ types. The resulting graph has 2,501,424 nodes, and 7,542,887 edges. We take the first 1.5
Summary of graph datasets
Hetrec graph after first two batches with node labels showing names of actors, directors, and movies.
Arnet graph after first batch with node labels.
The five datasets chosen allow us to evaluate our approach along several different graph types. The artificial datasets contain two different kinds of patterns, and the real world datasets contain low label diversity. A summary of the datasets in given in Table 1.
We evaluate the StreamFSM approach based on the following questions:
How does the run-time vary as we vary In a scenario where we are getting continuous batches of updates, can StreamFSM report the current set of frequent subgraphs in a timely manner? Can StreamFSM process the data stream at speeds higher than the stream-rate? Is StreamFSM accurate in terms of finding known substructures? How does StreamFSM compare against the state-of-the-art large graph miners? How relevant are the patterns found? What is the effect of varying the frequency threshold of the graph transaction miner on the run-time and accuracy? How does StreamFSM perform in terms of time and memory when working on graphs that are of the scale of millions of nodes and edges?
We answer these questions using the datasets described in Section 6. Our experimental settings are as follows: We implement our algorithm and interface to gSpan in C++. We use a Ubuntu 14 system with 4th Generation Intel Core i7-4770K Processor with 3.50 GHz base frequency, 8 MB Cache, and 16 GB DDR3 SDRAM (1600 MHz) memory.
In order to evaluate the runtime as we increase
Total running time (in secs) vs. number of sampled neighbors (extract bit reset after every batch).
Total running time (in secs) vs. number of sampled neighbors (extract bit not reset).
Fig. 7. Figure 8 shows the total runtime in seconds taken by the four datasets when the extract bit for every edge is never reset, ensuring that an edge cannot be extracted as part of a transaction more than once. As we can see this mode of the algorithm results in significant speed-up.
The second question, verifying that we can report the frequent subgraphs on time as each batch comes in, can be answered by measuring the average response time per batch in seconds, where the response time is defined as the time required to process a single batch of updates. We plot the average response time per batch against the neighborhood sample parameter in Figs 9 and 10. From Fig. 10, we can see that the average response time remains roughly the same for all the datasets except in the case of the HETREC dataset, where the running time decreases and then increases gradually. The reason for this is because, when
Average batch running time vs. number of neighbors sampled (extract bit reset after every batch).
Average batch running time vs. number of neighbors sampled (extract bit not reset).
We also look at the runtime per batch as an arbitrarily long graph streams in. In order to simulate such a graph we stream the Twitter data in thrice in the form of B1, B2,
Runtime per batch as TwitterExtended streams in.
Maximum running time vs. evolution time (of graph data)
The comparison between actual times and running time for evaluating our ability to keep up with the stream-rate can only be done for the real world datasets since the artificial graphs do not have any timing information. We perform these comparisons with the Twitter and Hetrec dataset by listing the total running time, when
We measure accuracy, thus answering the fourth question, by verifying that for the two artificial graphs the embedded patterns are discovered by the end of the stream of updates. Looking at the output of the frequent subgraph miner, the embedded subgraphs are indeed discovered as frequent subgraphs by the last batch of updates. For Artificial graph 2, however, since the patterns are in the form of a star shaped structure, the number of neighbors parameter directly affects the accuracy of the graph. As the number of neighbors,
Comparison of 1-edge subgraph counts in Twitter
We also compare the average running times of our algorithm versus the running times taken by two publicly available large graph miners, GERM [5] and SUBDUE [10] on the Twitter dataset. We list these in Table 4. As we see our algorithm outperforms GERM and SUBDUE on this dataset. This dataset is designed to be a disadvantage for frequent subgraph miners as it contains only one type of node label, and two types of edge labels. We also run SUBDUE on Artificial Graph 1 to compare the results with StreamFSM. For Artificial graph 1, SUBDUE finds 2306 instances of the embedded substructure, whereas StreamFSM (with
Comparison with GERM and SUBDUE using Twitter
Runtime while using Kashtan et al.’s sampling.
In order to compare the efficacy of our sampling algorithm, we compare it with the sampling scheme described by Kashtan et al. [16]. The sampling algorithm in [16] uses a random edge as seed and samples a random neighbour based region around the incoming edge. Their random neighbor sampling takes the form of a random walk with random restarts until the desired number of nodes
Frequent subgraph from (a) Twitter (b) Hetrec.
We picked some of the patterns from the output of the frequent subgraph miner that seemed to be interesting in terms of both frequency and size in order to discuss their relevance to the domains. The patterns from Twitter and HETREC are shown in Fig. 13. The pattern in Fig. 13a can be described as follows. It contains a single user who mentions 11 other users in one or more tweets, and can thus be considered to have a link with the other users. We find 1192 instances of this pattern. This kind of a pattern can be considered to be a characteristic pattern for a social network like Twitter. Of course, this pattern is probably not the only pattern that is characteristic of Twitter as a whole, but can definitely be considered to be a characteristic of this particular dataset, as well as one of the patterns characteristic of Twitter. The pattern in Fig. 13b from HETREC, contains actors denoted by ‘A’, and movies denoted by ‘M’ connected by a link labeled ‘acted_by’. The pattern shows a single actor who has acted in 9 movies, and is connected by a single movie to three other actors.There are 597 instances of this pattern. One of the interesting aspects about the dataset that this pattern illuminates is that there are at least 597 actors who have worked in 9 films and collaborated with 3 other actors. As we can see, these patterns are quite relevant to the domain of the graph.
We investigate the effect of varying the gSpan (or any graph transaction miner) frequency on the runtime of the algorithm and the accuracy of the results. In general, we see that the runtime decreases as we increase the gSpan frequency, as potentially gSpan has to evaluate fewer substructures. We study the potential change in accuracy by looking at the change in the number of frequent subgraphs discovered. With the exception of the Artificial graph 1, the number of frequent subgraphs discovered decreases. This decrease would result in missing structures that might become potentially frequent. Conceptually, there is a relationship between the gSpan frequency, and the StreamFSM threshold. As the StreamFSM threshold is increased, one might be able to choose a higher gSpan frequency with less reduction in accuracy. We present the results in Figs 14 and 15.
Total running time vs. gSpan frequency.
Number of frequent subgraphs discovered vs. gSpan frequency.
We also use the Twitter data to investigate the number of frequent subgraphs that we get as we vary the StreamFSM frequency
Number of frequent subgraphs discovered vs. gSpan frequency as StreamFSM frequency changes.
Total runtime vs. number of neighbors sampled (extract bit reset after every batch).
Total memory vs. number of neighbors sampled (extract bit reset after every batch).
Total runtime vs. number of neighbors sampled (extract bit not reset after every batch).
Finally, we evaluate the time and memory required by our algorithm while processing graphs that are on the order of millions of nodes and edges. We use the ArnetMiner graph for this purpose. The results of runtime versus the value of
We also show the time and memory usage when the ExtractBit is never reset in Figs 19 and 20. We see a similar trend as when the ExtractBit is not reset, except the increase starts after
Total memory vs. number of neighbors sampled (extract bit not reset after every batch).
While StreamFSM is able to process the ArnetMiner graph, it makes the assumption that the entire data can fit in memory. For an infinite stream of nodes and edges, this assumption can easily lead to the memory being filled up very quickly. It can also lead to longer runtimes. This problem can be alleviated by using a sliding window scheme on the stream of nodes and edges.
In this paper, we propose an algorithm called StreamFSM that is capable of continuously finding the current set of frequent subgraphs in a streaming labeled graph. Our algorithm is capable of doing this, without having to recompute all the frequent subgraphs from scratch, by only looking at the regions in the graph that have been changed due to the current batch of updates. We evaluate our algorithm on the 2 artificial graphs and 3 real world graphs and show that our algorithm is capable of processing the data streams at speeds higher than the stream rate, as well as give accurate results. We compare our approach with the state-of-the-art in frequent subgraph miners for static graphs and show that our algorithm outperforms them in terms of interestingness of results and execution time taken together. We also compare our sampling approach against the well known sampling approach developed by [16] and show that the runtime using this sampling algorithm is higher than our runtime, and the accuracy is worse for graphs with many star-shaped patterns. We then show that our algorithm scales similarly on very large graphs as the number of neighbors sampled increases. Finally, we analyze the algorithm and provide time complexity results as well as results on theoretical accuracy bounds using the well known Chernoff bounds.
The drawback of our algorithm is mainly in terms of several parameters that have to be tuned in order to get the optimal performance in terms of time and accuracy/interestingness of results. Also our approach relies on storing the entire graph in main memory. We plan to address the first issue by developing a scheme that automatically selects and/or adapts the parameters according to the graph statistics. In order to handle the second problem, we plan to use a sliding window on the streaming graph. We are currently looking into using a fixed size sliding window and plan to extend this to adaptive sliding windows.
Footnotes
Acknowledgments
We would like to acknowledge Diane J. Cook and Matthew E. Taylor at Washington State University for reviewing the paper and providing many helpful suggestions. This material is based upon work supported by the National Science Foundation under Grant No. 1646640.
