Abstract
We propose in this paper an efficient heuristic method to learn a set of classification rules from a set of graph objects. Graph classification has various real-life applications, however, this is a very challenging problem due to the intrinsic complex structure of graphs. The proposed rule constructing method is based on two lines of research. The first line of research is on Boosting [11] in which a weak-hypothesis is regarded as a rule and is assigned with a real-valued confidence. In our research, a rule is comprised by a set of subgraphs that maximize an objective function in each round of boosting. The second line of research is on utilizing the poset order of the Formal Concept Lattice of subgraphs to accelerate the process of generating rule candidates. The learned rule set is compact, comprehensible and obtains high classification accuracy on tested datasets.
Introduction
We consider in this paper the problem of predicting the label
where
This problem, often called graph classification, has attracted a lot of attention over the last twenty years. Applications of graph classification range from drug discovery (e.g., predicting whether or not a chemical compound has the desired biological activity, or is toxic or non-toxic), image analysis (e.g., image segmentation and object recognition), to biological network analysis, and social network (e.g., classifying users based on their feeds on Facebook, Twitter, etc.) [16, 1, 30].
The main challenge in classifying graphs is how to make sense out of graph structures. Existing methods, whether explicitly or not, use substructures such as walks, paths, subtrees, and subgraphs to represent graphs for the learning purpose [2, 13, 28]. Graph kernel methods, for example, represent a graph as a very high dimensional (and often sparse) vector of indicators of substructures, and then compute the dot product between two feature vectors by a recursive algorithm [13]. In general, it is impossible to generate all possible substructures beforehand, and therefore, methods that require to take all substructures into account are not always beneficial. Additionally, in the domain of graph classification, the prediction accuracy is not the unique goal. It is often required in various practical applications that the prediction rules are comprehensible to human. In the case of graph kernel methods for which features are not explicitly computed, the prediction rules are certainly not easy to interpret.
Boosting [11], a very successful classification algorithm that produces a linear combination of weak classifiers (a.k.a. base learners), has been extended for graph classification with improved accuracy and comprehensibility. Kudo et al. [21] proposed the first boosting algorithm for graph based on AdaBoost in which subgraph-based decision stumps were used as weak learners. In order to avoid the exhaustive enumeration of all subgraphs which is usually impractical, the authors proposed an efficient search-space pruning strategy based on the upper bound of the gain. It was shown experimentally that the proposed graph boosting algorithm leads to good accuracy and efficiency in classifying graph objects. However, this AdaBoost-based algorithm can still take many iterations to converge. Saigo et al. [29] proposed a mathematical programming boosting method that can build the prediction rule with fewer iterations compared with the AdaBoost-based method [21]. In the proposed algorithm called gBoost, single subgraphs were also used as weak learners.
Boosting-based algorithms for graph classification typically require a long training phase. In the algorithms of Kudo et al. [21] and Saigo et al. [29] the search spaces are usually very large despite the fact that they used branch-and-bound techniques and only decision stumps that are connected subgraphs were searched for. To further reduce the size of the search space, the minimum support constraint or the maximum subgraph size constraint have been used. These constraints, however, may results in losing of some optimal subgraphs. Another issue that makes boosting-based algorithms for graph classification hard is that we have to call the subgraph mining procedure for every boosting iteration because weights of examples must be recalculated.
In the above mentioned work, the feature space of subgraphs is completely constructed by subgraph mining [35], and weak learners are chosen as individual connected subgraphs. In some problems, choosing a subgraph as a weak learner can be a good solution and leads to good classification performance. However, it is not always the case that an individual connected subgraph is the optimal choice. In many complex graph classification problems, a weak learner of a subgraph can be extremely weak due to its limited discriminative ability in separating two or more classes. Boosting these poor decision stumps thus often leads to very long training phase because little training error decreases at each boosting step. Therefore, it should be a better idea to construct weak learners based on a group of subgraphs in cases no individual subgraph having enough discrimination power and a group of subgraphs may jointly providing higher discrimination power.
An example of five graphs in two classes: positive (+) and negative (-). Below are three frequent subgraph patterns 
For example, let us consider the graph set given in Fig. 1 where the first two graphs are positive and the remaining three graphs are negative. Three frequent subgraph patterns
We propose a method to build a set of graph classification rules where each rule antecedent is formed by a conjunction of subgraph conditions. In other words, the IF part of each rule is satisfied by the co-occurrence of subgraph patterns instead of a single subgraph pattern. We apply the confidence-rated boosting method to find optimal rule at each round of boosting and to reduce weights of graph examples covered by the rules generated so far. The key challenge in finding combined subgraphs is the exponentially-large search space,
A number of algorithms have been proposed for the graph classification problem [15, 18, 8, 21, 25, 34, 31, 29, 17, 19, 10]. One of the pioneering algorithms for graph classification is SubdueCL [15]. The algorithm searches for subgraphs present in positive examples and absent in the negative examples using a greedy, heuristic search strategy. There is an important line of research for the graph classification problem involving the development subgraph kernels [18, 14, 27, 23]. Kernel-based methods take all subgraphs into account to represent a graph as a very high dimensional vector of indicators of subsgraphs. However, using all subgraphs is not always efficient and effective. Therefore, many approaches restrict the set of subgraph features to frequently occurring, frequent closed, or class-correlated subgraphs [8, 3]. Another important line of research for the graph classification problem is based on the boosting approach [21, 29, 10]. Kudo et al. proposed to combine AdaBoost and the optimal pattern search algorithm. Saigo et al. proposed to learn from graph data using mathematical programming based on LP-Boost [7]. Fei et al. introduced an algorithm that considers the structure relationships of base learners in the functional space. A common point of these algorithms is that they use individual subgraphs as base learner. There are algorithms using different strategies for searching for optimal subgraphs [34, 31, 17].
There are also graph classification algorithms proposed under the semi-supervised setting [20, 19, 37]. In [20], subgraph features are selected from both labeled and unlabeled data using a feature evaluation criterion. The method proposed in [37] deals with the situation where only positive and unlabeled graphs are available. In [19] a process of active sample selection together with feature selection is performed.
In parallel, there are methods that represent graphs in a pairwise manner using similarity or distance measures [4, 33, 26]. Bunke and Shearer [4] proposed to use the maximum common subgraph of two graphs to compute the similarity between them. Wallis et al. [33] proposed to use the union of the two graphs and consider also the sizes of the graphs for computing the distance. Nguyen and Yamamoto [26] proposed a similarity measure based on the concept lattice, taking into account the lattice structure explicitly. The measures can be used in supervised, semi-supervised, unsupervised learnings and in matching graphs.
The rest of the paper is organized as following. In Section 2, we introduce a method for finding optimal rule based on the confidence-rated boosting method. In Section 3, we describe the Formal Concept Lattice for graphs and an heuristic method for rule growing based on the lattice. In Section 4, we describe our experiments on some real-world graph datasets to verify the effectiveness of the proposed method. We summarize the paper and provide directions for future work in Section 5.
Finding optimal graph classification rules
We build an ensemble of weighted rules using boosting. Each rule in the rule set is associated with a real number confidence rating. We generate only rules that predict membership in the positive class. That means, only rules with positive confidence rating are generated. For predicting membership in the negative class, we use a single default rule which is a rule with a negative confidence rating. This approach is similar to SLIPPER [5] and many other rule learners, resulting in a more comprehensible rule set.
The boosting technique can be considered as generalization of set covering. After a rule is found, covered examples are given lower weights instead of being removed from the training set. By this, a new rule will focus on examples which are hardest to classify by rules found so far. In propositional rule learning, one of the first approaches to do so was SLIPPER. Our proposed method relates to SLIPPER in that both are based on confidence-rated boosting, which is generalization of AdaBoost. The main difference between our method and SLIPPER is in the rule growing details. In SLIPPER conditions to be added to a rule are readily available in the training data, but for graph data rule conditions must be mined for. We propose to use the partial ordering in the lattice structure of graphs when growing rules. This method drastically reduces the complexity of calculations.
A generalized version of AdaBoost with real valued predictions
A generalized version of AdaBoost with real valued predictions
Given: (
Train weak learner using distribution
Get weak hypothesis
Choose
Update:
The generalized boosting algorithm is given in Algorithm 1 where
where
where
Equation (4) implies that an optimal classification rule based on the confidence-rated boosting method has a natural trade-off between accuracy (the proportion of the positive examples satisfied by a rule to the total number of examples that the rule satisfies) and coverage (the fraction of examples that satisfy the rule). This has practical use in building classification rule sets because we do not need to check rules having too small coverage (too small support set), thus this can help to save a lot of time. Particularly, in the graph mining domain, finding subgraphs of small supports often requires a lot of calculation time.
Equation (4) shows us how to pick the optimal classification rule in a set of rule candidates. However, it does not tell us how to construct the rule candidate set. In the next section, we introduce a method for dealing with this problem, that is, using an iceberg concept lattice of subgraphs to find optimal graph classification rules. Our method is an heuristic one that takes into account the partial order of sets of subgraphs on the lattice. Even though in this paper we discuss and implement only a method for two-class problem, the method can extends easily to multiple classes.
Let
The term Formal Concept Analysis (FCA) was introduced by Ganter and Wille in 1984 [12], and builds on applied lattice and order theory. FCA is a principled way of constructing a concept hierarchy from a collection of objects and their attributes. In [26], the authors proposed to apply FCA theory to graph data. We give a brief summary of the Formal Concept Lattice for graph data here, following what given in [26].
A formal concept is a pair
For graph data, each graph instance (for example, a chemical compound) is regarded as an object and all subgraphs of this graph are treated as attributes. Let
Let
The set of all formal concepts of graphs endowed with the
As pointed out in [26], the problem of constructing a complete lattice is NP-Complete. In almost every case, it is possible to construct an upper part of the lattice only (called an iceberg lattice) using frequent closed subgraphs. A subgraph is frequent if its support is not less than a given threshold
The formal concept lattice of the dataset in Fig. 1. a The entire concept lattice. b An iceberg part of 
The lattice structure, once built, could facilitate pattern understanding, knowledge discovery, and interactive exploration of complex graph data. In [26] the authors proposed to learn a similarity measure between two graph objects based on the iceberg lattice. In this paper we propose to learn a set of classification rule for graphs using the lattice. Although both methods are based on the iceberg lattice for the learning purpose, there is difference in how the lattice is used. In [26] the iceberg lattice is built once and unchanged during the learning process, in the method proposed in this paper the lattice can be extended accordingly at each round of rule learning. In this sense our method is somewhat similar to the method of [29] where the search space is stored in memory and progressively extended.
In machine learning, a classification rule takes the following form:
where each
We propose to search for graph classification rules on an iceberg lattice of subgraphs. Due to the closed property of formal concepts (each intent of a concept is a closed set and each subgraph in the intent is also a closed subgraph), for any subset
As mentioned previously, Eq. (4) implies that we do not need to scan the complete formal concept lattice when searching for rules because optimal rules must have large enough support sets, and thus it is appropriate to use an iceberg lattice for this task. The main problem here is that we do not know beforehand how deep an iceberg lattice should be in order for it to contain all the optimal classification rules. This is equivalent to that we do not know how to set the value of the parameter minsupp for the tasks of subgraph mining and lattice construction. Our heuristic method bases on the following assumption. Suppose that the used iceberg lattice is deep enough. There is a high possibility that optimal rules (corresponding to concepts having maximal
In the “standard theory” of FCA, the following theorem means that the intersection of extents of nodes in a lattice is always an extent of a node in the lattice. The intersection of intents of nodes in a lattice is always an intent of a node in the lattice.
.
(
Theorem 1 suggests that we can extend a lattice using the intersection of extents or intents of nodes in the lattice. However, in the “expanded version" of FCA for graph data this theorem no longer holds as shown in [26]. Specifically, the intersection of intents of nodes in a lattice of graph data is not always an intent of a node in the lattice.
.
(
The Join operator
If
The distance between two formal concepts
Let
Our method works as follows. Given an iceberg concept lattice. At each round of the boosting algorithm, we find
Stopping criterion
In the graph learning field, a good stopping criterion should: i) limit the maximum amount of time spent per each iteration; ii) stop if the error is no longer decreasing or decreasing too slowly; iii) stop if the support (the number of supporting graphs) of the rule is less than a given threshold
Putting all the above discussion together, Algorithm 2 gives our method for graph classification rule finding.
Finding rules based on the iceberg lattice
Finding rules based on the iceberg lattice
For each concept
The Meet (
The Meet (
We call this rule learning algorithm for graph classification as LGC (for
To avoid the computation of exact classification rules that can be expensive, we propose to find and use approximate rules corresponding to approximate intents of the optimal concepts instead. These approximate intents are found by using the following approximate Meet operator (denoted
The approximate Meet operator does not require to find maximal common subgraphs of graphs in the set
An example of the Meet operator and the approximate Meet operator.
We call this relaxed version of the LGC algorithm as aLGC (for
In this section, we evaluate the performance of the proposed algorithms LGC and aLGC by comparing them with some state-of-the-art algorithms for graph classification. LGC and aLGC are compared to gBoost [29], an efficient method for graph classification and regression based on boosting. gBoost adopts a mathematical programming-based learning method and uses single connected subgraphs as decision stumps. LGC and aLGC are also compared to CORK [31], another efficient method for graph classification which is not based on boosting. In CORK, features are also single connected subgraphs which are found by using a submodular selection criterion in a greedy forward manner.
Several statistics of the used datasets
Several statistics of the used datasets
Size: number of graphs in the dataset, MV: maximum number of vertices, AV: average number of vertices, ME: maximum number of edges, AE: average number of edges.
For the experiments, we use the datasets shown in Table 1. This includes one antiviral screen (AIDS) dataset, four anti-cancer screen datasets (NCI), and one Dobson and Doig (DD) molecular dataset. The AIDS dataset [32] originally contains 43,850 chemical compounds divided into three groups: 423 active (CA) compounds, 1,081 moderately active (CM) compounds and 42,346 inactive (CI) with respect to the HIV virus. This dataset can be found at the website
NCIs [32] are anti-cancer screen datasets collected from the PubChem website
DD is a data set of 1,178 protein structures [9]. Each protein is represented by a graph, in which the nodes are amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart. The edge labels are omitted. The DD dataset is divided into two classes: 691 enzymes and 487 non-enzymes. Among all the datasets, DD is the most densely connected one with an average size of 284 vertices and 715 edges.
PTC dataset (can be collected from
We use the CloseGraph algorithm [36] to generate frequent closed subgraphs at
Training time comparison of LGC, aLGC, and gBoost
Classification accuracies.
Classification AUC values.
For measurement, we use accuracy (ACC) as the performance measurement in our experiments because accuracy is an effective metric to assess the algorithm performance for balanced datasets. For better comparing the algorithms we also use the area under the ROC curve (AUC) [22]. The ACC values are shown in Fig. 4 and the AUC values are shown in Fig. 5. The standard deviations for all datasets except for PTC are less than 3% for all the algorithms. For the PTC dataset, standard deviations are larger, between 6% to 8%. As can be seen in the figures, LGC, aLGC and gBoost outperform CORK, both in accuracy and in AUC, for all the tested datasets. This result shows the advantage of boosting-based approaches over the greedy feature selection approach based on submodular quality criterion adopted by CORK. aLGC and gBoost give almost similar results on four cancer datasets, however aLGC gives better results on AIDS, DD and PTC datasets. LGC outperforms both aLGC and gBoost in ACC and in AUC, although the differences are not very large except for DD and PTC datasets. The result is encouraging and shows the effectiveness of selected features of co-occurrence subgraphs of our methods.
Next, we evaluate the runtime efficiency of the algorithms by comparing their training time. Table 2 shows the runtimes of LGC, aLGC and gBoost. We do not report the training times of CORK because the current release of CORK requires very long running time on tested datasets, especially on the Dobson and Doig (DD) dataset. As can be seen in the table, both LGC and aLGC are much faster than gBoost, an efficient global inductive method, on all four datasets. Since LGC, aLGC and gBoost all use the set of subgraph patterns generated by a mining algorithm for the learning purpose, their runtime behaviors are mainly determined by the used mining algorithm. As discussed previously gBoost takes longer time since it needs to call the mining algorithm multiple times with a low value of minsupp in order to obtain a good predictive rule set. LGC also calls the mining algorithm multiple times, however, the first call of the mining algorithm to build the iceberg lattice consumes most of the running time, other calls run on a very small dataset that is the graph set of the extent of optimal concepts with high value of minsupp (100% of the set size). aLGC is fastest because it calls the mining algorithm just once.
Varying minimum support thresholds – accuracy values.
Both LGC and aLGC are approximate algorithms and their performances depend on the size of the iceberg lattice, but to different extents. We test the performance of LGC and aLGC with difference values of minsupp to see how the prediction performance changes as minsupp increases. Figure 6 reports the test results on the NCI1 dataset. From the figure, we can see that accuracy values of aLGC algorithm decrease as minsupp increases from 5% to 13% in all datasets. Accuracy values of the LGC algorithm also decrease as minsupp increases but not as much as those of aLGC. However, we observe here that for both LGC and aLGC, accuracy decreases more at higher values of minsupp. This manifests the approximate nature of our proposed approach.
We have proposed in this paper a method for finding a set of classification rules for connected, node-and-edge-labeled, undirected graphs. Optimal rules are found based on the Confidence-rated Boosting method. The searching space is limited to the formal concept lattice constructed from the training graphs. The proposed approximation methods exploit the partial order among sets of subgraphs to accelerate the optimal rule finding process in each round of boosting. Experimental results show that the proposed methods achieve equal or better classification accuracy when compared to two state-of-the-art algorithms while saving considerable training time. For future work, we plan to utilize the minimal graph generators of the equivalent classes found in the lattice to construct rules which may results in rules of more generalization power.
Footnotes
Acknowledgments
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant 102.05-2013.37.
