Abstract
Generally, big interaction networks keep the interaction records of actors over a certain period. With the rapid increase of these networks users, the demand for frequent subgraph mining on a large database is more and more intense. However, most of the existing studies of frequent subgraphs have not considered the temporal information of the graph. To fill this research gap, this article presents a novel temporal frequent subgraph-based mining algorithm (TFSBMA) using spark. TFSBMA employs frequent subgraph mining with a minimum threshold in a spark environment. The proposed algorithm attempts to analyze the temporal frequent subgraph (TFS) using a Frequent Subgraph Mining Based Using Spark (FSMBUS) method with a minimum support threshold and evaluate its frequency in a temporal manner. Furthermore, based on the FSMBUS results, the study also tries to compute TFS using an incremental update strategy. Experimental results show that the proposed algorithm can accurately and efficiently compute all the TFS with corresponding frequencies. In addition, we also apply the proposed algorithm on a real-world dataset having artificial time information that confirms the practical usability of the proposed algorithm.
Introduction
Data mining is the process of extracting statistically significant and useful knowledge or information from the data in databases [1]. The data may exist in many forms such as vectors, tables, texts, images, and so on. It can also be presented by different means, such as structured data and semi-structured data. Structured data and semi-structured data are more suitable and naturally fit for graph representation. Mining of the graph data is referred to as graph mining or graph-based data mining. It is an important research area in the domain of data mining and has attracted much attention in recent years. Frequent subgraph mining (FSM) is an important aspect of graph mining. The main aim of FSM is to extract all the frequent subgraphs in a given data set, whose occurrence counts are above a specified threshold [2]. One of the main challenges in FSM is the information mining of graphs with dynamic nature, which is called temporal FSM.
Many real-world graph applications contain temporal information. For example, telecommunication companies record massive amounts of phone calls and Short Message Services (SMS) every day, where each phone call or SMS record contains attributes about the sender, the recipient, and the time when the phone call was made, or the SMS was transmitted. As another example, online social networking companies keep logs about the interactions between users and the time when each interaction occurred. We can take advantage of temporal diverse information networks (DIN) to capture this type of relational structure. In DIN, nodes represent individuals of different entity types and edges stand for several kinds of correlation. Besides, each edge is related to a time interval to record the durations that two individuals come across and interact with, where each interval entails a starting timestamp and an end timestamp. Alternatively, considering real-world social activities typically follow a particular intervallic property, such as a day for human life and a year for a conference venue. This kind of data overtime period should be integrated into DIN. Hence, we regard social activities in a specific time period as a temporal snapshot by gathering networks graphs (
The main contributions of this article are summarized as:
A new framework for graph mining strategy is studied called the TFS mining framework. A novel TFS mining algorithm is proposed using a spark environment. The proposed algorithm is applied on a real-world dataset having artificial time information that confirms the practical usability of the proposed algorithm. Extensive experiments have been performed on different data sets including real-world data set. Experimental results of the proposed algorithm are compared with non-TFS mining in order to show its efficiency and accuracy.
The rest of the paper is organized as follow: Section 2 presents related work, Section 3 describes the graph preliminaries, Section 4 discusses the proposed method for mining frequent substructure patterns in temporal graph dataset, and Section 5 presents performance evaluation of the proposed method. Finally, Section 6 concludes the paper with future work.
Generally, several types of processes are characterized by a set of actors that interact in a network over a period. In many real-world applications (social and communication networks), one of the common objectives is the dynamic behavior modeling of the component entities (people). Recently, a considerable number of studies have focused on the analysis of the practical applications, including the evolution of bibliographic databases [3, 4, 5, 6, 7], influence and information spread in a social network [8, 9, 10], and the web and other information networks [11, 12]. Most of the existing studies employed various graph mining algorithms, such as GRAMI [13].
The GRAMI algorithm is one of the most effective single-machine algorithms for mining frequent subgraphs on a single graph. It turns the problem of mining frequent subgraphs into the problem of constraint satisfaction problem (CSP). However, the GRAMI algorithm is a stand-alone algorithm, which has low efficiency in large-scale graph operation and is difficult to support the mining of frequent subgraphs with low support. Moreover, the existing algorithms for FSM of a single graph in a distributed environment [14, 15] are for the directed graph and need to specify the number of vertices of subgraphs. It does not support the mining of subgraph growth patterns.
Most of the other distributed frequent subgraph mining algorithms [16, 17] are based on the MapReduce framework [18], which need to read and write disks many times during the iterative calculation, resulting in a large amount of input and output (I/O), serialization and deserialization costs, and also cannot be transplanted to the single graph for mining. Based on the above defects, recently, Frequent Subgraph Mining Based Using Spark (FSMBUS) [19] and Spark-based Single Graph Mining (SSIGRAM) [20] algorithms are proposed, which are based on Apache Spark Framework. But the above-mentioned spark-based algorithms lack the temporal information attributes of the temporal graph.
However, most of the existing graph mining algorithms mentioned above do not consider temporal information in a graph, which fails to exploit the temporal attributes for detecting important temporal patterns. This work attempts to analyze the TFS using an FSMBUS algorithm with a minimum support threshold and temporally evaluate its frequency. Furthermore, based on the FSMBUS results, this article also tries to compute TFS using an incremental update strategy.
Graph preliminaries
In this section, some basic concepts of graph theory and notations used in the paper are defined.
Graph-1 is a subgraph but Graph-2 is not subgraph of Graph (a). 
TFS mining framework. Vertex RDD (
The structure of CSP domain.
In this section, we propose an efficient TFSBMA using spark. The TFS parallel mining method from the temporal dataset is mainly designed based on FSM using spark distributed architecture illustrated in Fig. 2. It has two significant phases: the initial phase and the mining phase. The initial phase is also called the preparation phase, where the system loads graph data having time attribute from Hadoop Distributed File System (HDFS), divides the edges by time series, makes single Resilient Distributed Datasets (RDDs) for the same time attribute, and compute frequent edges. The mining phase has two sub-phases FSM and temporal frequency calculation phase. The FSM phase receives the frequent edges RDD (
The main tasks of the initial phase that need to be completed are generating multiple
Meanwhile, this algorithm is applied to the mining of undirected graphs, such as on
Mining phase
It is the main phase of our framework, which consists of FSM and temporal frequency calculation. In the FSM phase, all frequent subgraphs are calculated iteratively, including candidate generation of searching the data fields for corresponding candidate subgraphs and its support calculation. In the temporal frequency calculation, first the frequency is assigned to each frequent subgraph calculated for the time periods in the previous step, after that, the cumulative frequency is calculated for all generated subgraph with respect to the time period.
Frequent subgraphs of a social interaction network
Frequent subgraphs of a social interaction network
In FSM, we have used the FSMBUS [17] mining strategy due to its efficiency and accuracy. First, all the candidate frequent subgraphs are generated in the
Temporal frequency calculation
According to the FSM step, all frequent subgraphs are efficiently calculated and stored in the lists
When the search is completed, the results are combined in one RRD for the next search of
The flow of temporal frequency calculation.
During the frequency calculation, If the data item (frequent subgraph) frequently appears in the data flow and the last acquiring time is close to the current time node, then it means that the frequency of this data item is high at the current time. It is more likely to become frequent, which forms the basic idea of dynamic frequency-time accumulation calculation based on time accumulation. The details of the calculation process are as follows.
The number of counting times divided by time granularity into different time periods of data flow from the initial time to the current t time, which is represented by
Algorithm 1 shows the whole process of temporal FSM and how it is conducted in the spark framework, in which temporal graph datasets and minimum support for frequent subgraphs are given as input. After initializing inputs, the graph is created with the combination of vertices for each edge data in a specified time period. In the lines (5–7), all frequent edges are calculated for the generated graph
The temporal frequency is calculated by using Algorithm 2. It shows the temporal frequency calculation process for every frequent subgraph data generated in the mining step. It will take
Experimental results and discussion
In this section, we will describe the experimental process and results along with a comparative discussion.
Experimental setup and data
The main issue addressed in this article is the TFSs mining in a single graph using the spark framework. The whole experimental setup is given as: a host PC with Intel Core i7-7700HQ CPU at 2.40 GHz and 20 GB RAM with Ubuntu 19.04 operating system and spark version 2.3.1 with the Scala version 2.11. The spark cluster is built on it with one master node and two worker nodes with the same configuration for the master node with 8 G memory and each worker node with a 2 G memory.
The main graph data involved in our work is a single large graph. To show the accuracy and efficiency of our solution followed by using different experimental processes on simulation data generated by random function using in java, which containing temporal information. Furthermore, the study categorized the whole simulation data into 11 years from 2000 to 2010, having 100,000 vertices with 100 distinct labels and combine the total number of edges (relation) are 1,080,298 with high density, and the characteristics of the annual data are summarized in Table 2.
Simulation database
Simulation database
In order to verify the effectiveness of temporal frequency, we compare the results of frequent subgraphs (non-temporal frequency) and frequent subgraphs with the temporal frequency (which is also called TFSs) of each year with last year correspondingly. We computed the frequent subgraphs using three different minimum support values for each year, and the corresponding results are shown in Table 3 with four columns. Each row represents the number of frequent subgraphs that occurred each year with respect to the minimum support value. The support values 5, 6, and 7 are respectively used in experiments and placed in the heading of the table.
Frequent subgraph for every year
Frequent subgraph for every year
Temporal frequent subgraphs TFSs.
The data from 2000 to 2010 is extracted by using FSM techniques for each year, then the temporal frequency is calculated for every two years correspondingly. For example, in the temporal frequency calculation for 2001, we take the frequent subgraphs of 2000 and searched that how many of them have occurred in the 2001 subgraphs, and this process is repeated for different data calculated by different minimum support. All the results are illustrated in Fig. 5a, where the x-axis consists of time periods with values from 2001–2010, and the y-axis shows the number of temporal subgraphs occurrences of the corresponding year data. As shown in the graph, we can see the dramatic change when the minimum support value is decreased, which increased the number of frequent subgraphs as well as TFSs exponentially. It can also be found in Table 3 that the average number of frequent subgraphs data calculated with the minimum support-7 is less than the data calculated with support-6 and much less than support-5. However, the average accuracy of the TFSs of minimum support-7 and support-6 (2.69% and 8.39%) is very low with respect to data calculated with minimum support-5, which is 17.01%. It is evident that with less minimum support, more frequent subgraphs may result in temporal frequency, which will turn in results for better future prediction.
On the other hand, Fig. 5b shows the results of the TFSs for all years, where the y-axis represents the number of TFSs, and the x-axis represents the frequency of the resultant data with different three line for different minimum support. As shown in Fig. 5, the cumulative number of subgraph occurrences is very high for all three lines representing different minimum support data at the frequency “1”, which shows the non-TFSs because these subgraphs occurred only once in the whole time period. We can also see that most of the TFSs have a frequency equal to “2”. In addition, the number of occurrences is decreasing when the temporal frequency is increasing as shown in Fig. 5b.
Figure 6 illustrates the results of the contrast effects of TFSs and non-TFSs from 2006 to 2010, where the y-axis indicates the number of TFSs and non-TFSs occurrences, and the x-axis consists of time periods with values from 2006–2010, respectively. The contrast experiments of the forecast results of TFSs and non-TFSs are conducted for every year with the compression of the last five years to that year such as mining for 2006, we have used 2001 to 2005, for mining 2007, we used 2002 to 2006. Similarly, the same mining procedure for the rest of the years.
Contrast effects of TFSs and non-TFSs.
We have used the “20” threshold values, which is the lowest minimum support value for every five years, and the threshold value “5” is used for each year. As can be seen from Fig. 6, the number of TFSs is gradually increased, which predicted that the temporal frequency is more in the 2011 dataset. It also shows that the number of TFSs appears less than the number of non-TFS which indicates that many meaningless subgraphs are excluded. The average number of subgraphs in temporal frequency prediction of five groups’ data is 59.91% less than non-temporal frequency prediction. Therefore, the prediction effect of temporal frequency is better in the future for decision making, because it accumulates the time to make the frequent subgraph mining data more practical and useful.
The scalability experiments of the proposed TFSBMA were conducted using Spark is to evaluate the time effect of executors in the Spark cluster and built on a host configured with Ubuntu19.04, Intel Core i7-7700HQ CPU at 2.80 GHz with 20 GB RAM and four additional virtual nodes are set up with the same configuration of 2 GB memory. The computing nodes are parallel connected to the master node. The time performance of the algorithm is calculated for combined temporal frequency for all years with the support values (5, 6, 7), respectively. The experimental results are shown below in Fig. 7a and b.
Where Fig. 7a shows the results of all support values and Fig. 7b shows only support-6 and support-7 for better understanding the difference of executions. As shown in Fig. 7a, under the threshold values for temporal frequency calculation, the run time taken by the algorithm is decreasing, and the number of executors is increasing. As the number of executors increases, the time consumed by the algorithm decreases, and the curve starts to slow down. Also, for support-6 and support-7, which can be seen clearly in Fig. 7b, due to the parallel process among computing nodes causes an increase in time overhead. At the same time, the execution number is the same, and the threshold for temporal frequency calculation is different. However, the running time still continuously decreases as the number of core processes is increasing for all worker nodes, and the amount of the data calculation remains the same, which verifies the efficiency of a distributed environment.
Time performance of spark executors.
The reasons why the proposed algorithm performed better than non-temporal can be attributed to the following two aspects:
The proposed algorithm is based on Spark, which enables us to process a large amount of data, beyond what can fit on a single machine with a high-level library. Spark’s design and interface are unique, and it is one of the fastest systems of its kind. Since the proposed algorithm can find frequent subgraphs using the lowest threshold value over a specific interval, therefore, we get more frequent subgraphs, which can lead to better future prediction. As a result, the proposed algorithm performs better than non-temporal FSM.
Mining frequent subgraphs from the graph-based structure is a visual representation of structural information. New methodologies are discovered to explore the structural relationships of a large graph, especially when the graph is dynamic. In this paper, we proposed a new framework and a novel algorithm TFSBMA for TFS mining using spark for large temporal graphs. Our framework consists of two phases: the initial phase, and the mining phase, where the initial phase is the data preparation phase, and the mining phase is responsible for FSM and temporal frequency calculation. Extensive experiments were conducted in different ways to predict the possible events for the future. The effectiveness and performance of the proposed framework and algorithm are evaluated by performing different contrast experiments. Compared to the non-temporal frequency frequent subgraphs, the proposed algorithm has higher accuracy for future prediction.
In this work, model temporal datasets were used for experimental results, which is the cause of a small percentage of prediction in future data. In the future, it can be improved with real temporal graph datasets which can achieve the actual percentage in the data for better future prediction. On the other hand, FSM with the lowest minimum support is a significant factor for better future prediction in a temporal way. In the mining phase, when a subgraph involves a node with a very big degree, the FSM phase algorithm will not go to the next iteration until the executor finishes calculating the support of this subgraph. In the future, we plan to design a strategy that decomposes the evaluation task for this type of subgraph to all executors, which may accelerate the search speed further.
