Summarizing significant subgraphs by probabilistic logic programming

Abstract

Although recent advances of significant subgraph mining enable us to find subgraphs that are statistically significantly associated with the class variable from graph databases, it is challenging to interpret the resulting subgraphs due to their massive number and their propositional representation. Here we represent graphs by probabilistic logic programming and solve the problem of summarizing significant subgraphs by structure learning of probabilistic logic programs. Learning probabilistic logical models leads to a much more interpretable, expressive and succinct representation of significant subgraphs. We empirically demonstrate that our approach can effectively summarize significant subgraphs with keeping high accuracy.

Keywords

Subgraph summarization significant subgraph mining statistical significance probabilistic logic programming logic programs with annotated disjunctions

1. Introduction

Pattern mining [1] is the process of finding multiplicative combinations of features (variables), or patterns, from a dataset, which has been actively studied as one of the central topics of data mining [42]. Various types of patterns have been used in applications. Examples include itemsets [13], which are combinations of binary features originally used in market basket analysis to find frequently co-purchased items, and sequences [24], used in DNA sequence analysis and customer behavior analysis. In this paper we focus on subgraph mining [14, 40], whose task is to find frequently occurring subgraphs from a collection of graphs, which is often called a graph database. Since a graph is a fundamental data structure and a wide range of graph-structured data is available, such as chemical compounds in PubChem [7] and protein structures in PDB [6], subgraph mining has been studied as an important branch of pattern mining to analyze graph-base data.

As an extension of the original subgraph mining problem, significant (discriminative) subgraph mining [19, 33] is recently attracting a considerable attention, which tries to find subgraphs enriched in one class relative to another class. For example, in drug discovery, each chemical compound is modeled as a graph and a collection of graphs is divided into two classes, case and control, and one can find subgraphs that are enriched in the case group while not in the control group. Significant subgraph mining offers to find all subgraphs that are statistically significantly associated with the class variable while correcting for multiple testing to ensure rigorous control of the FWER (family-wise error rate). This means that significant subgraph mining can rigorously control the probability to detect one or more false positive subgraphs, which is indispensable in drug discovery and other scientific fields such as biology and medicine.

However, the challenge of significant pattern mining is that it often produces millions of significant subgraphs in a propositional representation, resulting in hard interpretability of the obtained subgraphs. How to summarize such massive amount of significant subgraphs in a principled way is still an open problem. Although pattern compression has been studied [1, Chapter 8], with the application of the MDL (Minimum Description Length) principle [38] to find representative patterns, or ILP based approaches have been used [11], none of the existing methods has been successfully applied to significant subgraph mining.

Our goal in this paper is to summarize significant subgraphs using Probabilistic Logic Programming to achieve better interpretability of massive subgraphs.

Probabilistic Logic Programming (PLP) is gaining popularity due to its ability to represent relational domains with many entities connected by complex and uncertain relationships. First-order logic is a powerful language to represent complex relational information, thanks to its intrinsic expressivity, while probability is the standard way to represent uncertainty in knowledge. One of the most fertile approaches to PLP is the distribution semantics [30], that is at the basis of several languages such as the Independent Choice Logic [25], PRISM [31], Logic Programs with Annotated Disjunctions (LPADs) [37] and ProbLog [10]. Various algorithms for learning parameters and structure of probabilistic logic programs written in these languages have been proposed, such as PRISM [32], LFI-ProbLog [12] and EMBLEM [4] (for parameter learning), Sem-CP-Logic[20], CLP(BN)[8] and SLIPCOVER [5] (for structure learning).

In this paper, the key to mine a compact general representation of a collection of significant subgraphs is to use the state-of-the-art structure learning algorithm SLIPCOVER for probabilistic logic programs, which enables us to encode a set of significant subgraphs as a probabilistic logical model written in the language of LPADs. The advantages of building such a symbolic model are:

1.
storing a logic program encoding significant subgraphs is significantly cheaper than storing all subgraphs, which are often too large in size and several in number; a single logical rule can describe many significant subgraphs at once;
2.
a first-order logic-based representation of a collection of subgraphs is declarative and comprehensible by humans, and much more expressive than a propositional representation;
3.
probability allows the management of uncertainty in complex application domains (such as the biological ones).

This paper provides the first application of Probabilistic Logic Programming to the problem of significant subgraph mining from standard biochemical datasets. The effectiveness of the approach, that we call LIPS for “Learning sIgnificant Plp Subgraphs”, is empirically verified by showing that the precision and recall of the LPADs learnt by SLIPCOVER are better than those of a baseline of probabilistic logic programs with fixed probabilities. Since our objective is not classifying graphs for predictive analysis but finding interpretable summarized representations of significant subgraphs for descriptive analysis, existing graph classification approaches (e.g. [15]) cannot be applied to our task.

The paper is organized as follows. Section 2 provides the necessary background about probabilistic logic programming and subgraph mining. Section 3 introduces the proposed approach based on PLP. Section 4 experimentally evaluates the method. Section 5 presents related work and Section 6 concludes the paper.
2. Background

2.1 Probabilistic Logic Programming

We assume that the reader is familiar with basic notions of First-Order Logic (FOL).

In this paper we rely on Probabilistic Logic Programming under the distribution semantics [30] for representing uncertain relational information. We consider Logic Programs with Annotated Disjunctions (LPADs) for their general syntax and we do not allow function symbols; for the treatment of function symbols see [29]. LPADs [37] allow to encode “alternatives” in the head of clauses in the form of a disjunction, in which each atom is annotated with a probability.

These programs consist of a finite set of annotated disjunctive clauses $C_{i}$ of the form:

$\displaystyle h_{i1}:\Pi_{i1};\ldots;h_{in_{i}}:\Pi_{in_{i}}:-b_{i1},\ldots,b_% {im_{i}}.$

Here, $b_{i1},\ldots,b_{im_{i}}$ are logical literals which form the body of $C_{i}$ , denoted by $\textit{body}(C_{i})$ , while $h_{i1},\ldots h_{in_{i}}$ are logical atoms and $\{\Pi_{i1},\ldots,\Pi_{in_{i}}\}$ are real numbers in the interval [0,1] such that $\sum_{k=1}^{n_{i}}\Pi_{ik}\leqslant 1$ . Note that if $n_{i}=1$ and $\Pi_{i1}=1$ the clause corresponds to a non-disjunctive clause. Otherwise, if $\sum_{k=1}^{n_{i}}\Pi_{ik}<1$ , the head of the annotated disjunctive clause implicitly contains an extra atom null that does not appear in the body of any clause and whose annotation is $1-\sum_{k=1}^{n_{i}}\Pi_{ik}$ . The grounding of an LPAD $\mathcal{L}$ is denoted by $\textit{ground}(\mathcal{L})$ .

An atomic choice is a triple $(C_{i},\theta_{j},k)$ where $C_{i}\in\mathcal{L}$ , $\theta_{j}$ is a substitution that grounds $C_{i}$ and $k\in\{1,\ldots,n_{i}\}$ identifies a head atom of $C_{i}$ . In other words, it represents the selection of the $k$ -th atom from the head of the ground clause $C_{i}\theta_{j}$ . It corresponds to an assignment $X_{ij}=k$ , where $X_{ij}$ is a multi-valued random variable which corresponds to $C_{i}\theta_{j}$ . A set of atomic choices $\kappa$ is consistent if only one head is selected from a ground clause. In this case it is called a composite choice. The probability $P(\kappa)$ of a composite choice $\kappa$ is computed by multiplying the probabilities of the individual atomic choices, i.e. $P(\kappa)=\prod_{(C_{i},\theta_{j},k)\in\kappa}\Pi_{ik}$ . A selection $\sigma$ is a composite choice that, for each clause $C_{i}\theta_{j}$ in $\textit{ground}(\mathcal{L})$ , contains an atomic choice $(C_{i},\theta_{j},k)$ . It identifies a world $w_{\sigma}$ of $\mathcal{L}$ , i.e., a normal logic program defined as $w_{\sigma}=\{(h_{ik}\leftarrow\textit{body}(C_{i}))\theta_{j}|(C_{i},\theta_{j% },k)\in\sigma\}$ . Since selections are composite choices, the probability of the worlds is $P(w_{\sigma})=P(\sigma)$ . We denote by $S_{\mathcal{L}}$ the set of all selections and by $W_{\mathcal{L}}$ the set of all worlds of a program $\mathcal{L}$ .

We consider only sound LPADs, where each possible world has a total well-founded model, so $w_{\sigma}\models Q$ means that the query $Q$ is true in the well-founded model of the program $w_{\sigma}$ . The probability of a query $Q$ given a world $w$ is $P(Q|w)=1$ if $w\models Q$ and 0 otherwise. The probability of $Q$ is then:

$\displaystyle P(Q)=\sum_{w\in W_{\mathcal{L}}}P(Q,w)=\sum_{w\in W_{\mathcal{L}% }}P(Q|w)P(w)=\sum_{w\in W_{\mathcal{L}}:w\models Q}P(w)$ (1)

.

The following LPAD $\mathcal{L}$ encodes the development of an epidemic or pandemic:

$\displaystyle C_{1}=\text{epidemic:0.6; pandemic:0.3: -- flu(X), cold}.$ $\displaystyle C_{2}=\text{cold:0.7}.$ $\displaystyle C_{3}=\text{flu(david)}.$ $\displaystyle C_{4}=\text{flu(robert)}.$

This LPAD models the fact that if somebody has the flu and the climate is cold the possibility that an epidemic arises has probability 0.6 to be true, that a pandemic arises has probability 0.3 or that no event happens (the implicit atom $\mathit{null}$ ) has probability 0.1. There is uncertainty about the climate, it may be cold with a probability of 0.7 but surely David and Robert have the flu.

Clause $C_{1}$ has two groundings, $C_{1}\theta_{1}$ with $\theta_{1}=\{\textit{X/david}\}$ and $C_{1}\theta_{2}$ with $\theta_{2}=\{\textit{X/robert}\}$ so there are two random variables $X_{11}$ and $X_{12}$ . $\mathcal{L}$ has 18 possible worlds, the query $Q=\textit{epidemic}$ is true in 5 of them and its probability is obtained as $\textit{P(epidemic)}=$ 0.6 $\cdot$ 0.6 $\cdot$ 0.7 $+$ 0.6 $\cdot$ 0.3 $\cdot$ 0.7 $+$ 0.6 $\cdot$ 0.1 $\cdot$ 0.7 $+$ 0.3 $\cdot$ 0.6 $\cdot$ 0.7 $+$ 0.1 $\cdot$ 0.6 $\cdot$ 0.7 $=$ 0.588.

Figure 1.

The problem setting of significant subgraph mining.

The semantics associates one random variable with every grounding of a clause. In some domains, this may result in too many random variables, so we may introduce an approximation at the level of the instantiations, at the expenses of the accuracy in modeling the domain. A typical compromise is to consider the grounding of variables in the head only: in this way, a ground atom entailed by two separate ground instances of a clause is assigned the same probability, all other things being equal, of a ground atom entailed by a single ground clause, while in the standard semantics the first would have a larger probability as more evidence is available for its entailment. In the approximate semantics clause $C_{1}$ of Example 1 is associated to a single random variable $X_{1}$ . In this case $\mathcal{L}$ has 6 instances, the query epidemic is true in 1 of them and its probability is $\textit{P(epidemic)}=$ 0.6 $\cdot$ 0.7 $=$ 0.42.

An efficient technique for computing the probability of a query consists of building a Binary Decision Diagram (BDD), representing the disjunction of its explanations, and performing inference over it. Inference can be performed with a dynamic programming algorithm that is linear in the size of the BDD [10]. Algorithms that adopt such an approach for inference include [26, 27, 28]. BDDs can be built in practice by highly efficient software packages such as CUDD.1

Available at http://vlsi.colorado.edu/ fabio/CUDD/.

2.2 Significant subgraph mining

In significant subgraph mining, each graph is defined as a triple $G=(V,E,\phi)$ composed of the vertex set $V$ , the edge set $E\subseteq V\times V$ , and the label mapping $\phi:V\cup E\to\Sigma$ with the range $\Sigma$ of vertex and edge labels. A graph $H=(V^{\prime},E^{\prime},\phi^{\prime})$ is a subgraph of $G$ , denoted by $H\sqsubseteq G$ , if $V^{\prime}\subseteq V$ , $E^{\prime}\subseteq(V^{\prime}\times V^{\prime})\cap E$ , and $\phi^{\prime}(A)=\phi(A)$ for all $A\in V^{\prime}\cup E^{\prime}$ are satisfied. Given two collections of graphs $\mathcal{G}$ and $\mathcal{G}^{\prime}$ with $|\mathcal{G}|=n$ and $|\mathcal{G}^{\prime}|=n^{\prime}$ , we assume $n\leqslant n^{\prime}$ without loss of generality. $\mathcal{G}$ and $\mathcal{G}^{\prime}$ represent two different classes of graphs that we want to distinguish. In Fig. 1, we have four graphs in $\mathcal{G}$ and also four graphs in $\mathcal{G}^{\prime}$ , and colors of vertices and line types of edges denote their labels.

For each subgraph $H\sqsubseteq G$ with $G\in\mathcal{G}\mathop{\cup}\mathcal{G}^{\prime}$ , we test the statistical association between the occurrence of $H$ and the class membership, where the null hypothesis is that the occurrence of the subgraph $H$ is independent of the class membership of $G$ . More precisely, we measure the statistical association between two binary random variables: the indicator vector of the class membership of graph $G$ and the occurrence/absence of the subgraph $H$ in each graph $G$ in the $\mathcal{G}$ and $\mathcal{G}^{\prime}$ databases.

Let $x$ and $x^{\prime}$ be the frequencies of $H$ in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ , respectively. That is,

$\displaystyle x=|\{G\in\mathcal{G}\mid H\sqsubseteq G\}|,\quad x^{\prime}=|\{G% \in\mathcal{G}^{\prime}\mid H\sqsubseteq G\}|.$

Then the occurrence of $H$ can be represented as the following 2 $\times$ 2 contingency table:

	Occurrences	Non-occurrences	Total
$\mathcal{G}\phantom{{}^{\prime}}$	$x\phantom{{}^{\prime}}$	$n-x$	$n\phantom{{}^{\prime}}$
$\mathcal{G}^{\prime}$	$x^{\prime}$	$n^{\prime}-x^{\prime}$	$n^{\prime}$
Total	$x+x^{\prime}$	$(n-x)+(n^{\prime}-x^{\prime})$	$n+n^{\prime}$

For example, for the subgraph shown in Fig. 1, $x=$ 4, $x^{\prime}=$ 0, and $n=n^{\prime}=$ 4. The association between two binary random variables is measured by Fisher’s exact test as the $p$ -value, which is the probability of false positives assuming that the null hypothesis is true, that is, occurrences of the subgraph and classes are statistically independent. The false positive occurs if we reject the null hypothesis while it is true. We say that a subgraph $H$ is statistically significant if its $p$ -value is smaller than a predetermined significance level $\alpha$ .

The technique of significant subgraph mining proposed in [19, 33] finds all subgraphs that are statistically significantly associated with the class variable, that is, the above null hypothesis is rejected, while correcting for multiple testing to ensure rigorous control of the FWER (family-wise error rate). The FWER is the probability that at least one subgraph is a false positive in the set of all subgraphs (hypotheses). Since the FWER approaches one even if the false positive rate is controlled under the significance level $\alpha$ for each subgraph, the significant subgraph mining technique performs multiple testing correction by decreasing the significance level $\alpha$ so that the condition $\mathrm{FWER}\leqslant\alpha$ is guaranteed.

3. The proposed method: LIPS

We provide here a detailed description of our proposed method, called LIPS (Learning sIgnificant Plp Subgraphs), in three subsequent steps, which summarize a given set of significant subgraphs as a probabilistic logic program.

3.1 First-order logic representation of subgraphs

Here we introduce the first step, which provides a FOL representation of a given set of subgraphs.

Interestingly, the significant subgraph mining technique [33] always produces the set of testable subgraphs as an intermediate result for the FWER control. Mathematically, a testable subgraph is defined in the following. The tight lower bound $\psi(H)$ of the $p$ -value for a subgraph $H$ is given as

$\displaystyle\psi(H)=\left\{\begin{array}[]{ll}\binom{n}{x}\bigg{/}\binom{n+n^% {\prime}}{x}&\quad\text{if }0\leqslant x+x^{\prime}\leqslant n,\\ 1\bigg{/}\binom{n+n^{\prime}}{x}&\text{otherwise},\end{array}\right.$

This was firstly considered in [35] and used in [33]. Assume that the set of subgraphs is sorted in increasing order according to the lower bound $\psi$ , resulting in the sequence

$\displaystyle\psi(H_{1})\leqslant\psi(H_{2})\leqslant\psi(H_{3})\leqslant\dots.$

Let $k$ be the natural number such that

$\displaystyle k\cdot\psi(H_{k})<\alpha\quad\text{and}\quad(k+1)\cdot\psi(H_{k+% 1})\geqslant\alpha.$

The subgraphs $H_{1},H_{2},\ldots,H_{k}$ are defined to be testable subgraphs and each testable subgraph $H_{i}$ , $i\in\{1,2,\dots,k\}$ is statistically significant if its actual $p$ -value is smaller than $\alpha/k$ [34].

Given two sets of graphs $\mathcal{G}$ and $\mathcal{G}^{\prime}$ , let $\mathcal{T}$ and $\mathcal{S}$ be the sets of testable subgraphs and significant subgraphs respectively, produced by significant subgraph mining. Since we always have the relationship $\mathcal{S}\subseteq\mathcal{T}$ , we formulate the problem of summarization as discriminating positive instances (significant subgraphs) $\mathcal{S}$ from negative instances (testable but non-significant subgraphs) $\mathcal{T}\setminus\mathcal{S}$ .

Our method takes as input a set of testable subgraphs, which includes positive and negative instances, and transforms them into a set of corresponding logical interpretations (sets of ground facts). Let $G=(V,E,\phi)\in\mathcal{T}$ be a graph such that $V=\{v_{1},v_{2},\dots,v_{m}\}$ , $E=\{\{v_{i_{1}},v_{j_{1}}\},\{v_{i_{2}},v_{j_{2}}\},\dots,\{v_{i_{l}},v_{j_{l}% }\}\}$ , and $\phi:V\cup E\to\Sigma$ assigns labels to vertices and edges with the range $\Sigma$ . Its logical representation is defined as:

$\displaystyle\mathtt{node}(v_{1},\phi(v_{1})).$ $\displaystyle\mathtt{node}(v_{2},\phi(v_{2})).$ $\displaystyle\dots$ $\displaystyle\mathtt{node}(v_{m},\phi(v_{m})).$ $\displaystyle\mathtt{edge}(v_{i_{1}},v_{j_{1}},\phi(\{v_{i_{1}},v_{j_{1}}\})).$ $\displaystyle\mathtt{edge}(v_{i_{2}},v_{j_{2}},\phi(\{v_{i_{2}},v_{j_{2}}\})).$ $\displaystyle\dots$ $\displaystyle\mathtt{edge}(v_{i_{l}},v_{j_{l}},\phi(\{v_{i_{l}},v_{j_{l}}\})).$

In the above representation, node/2 describes a node by means of the node’s id and the node’s label (node(id,label)), while edge/3 describes an edge by means of the two nodes linked by the edge and the edge’s label (edge(id1,id2,label)).

For example, if $G=(V,E,\phi)$ is given as $V=$ {0, 1, 2, 3, 4}, $E=$ {{0, 1}, {1, 2}, {2, 3}, {3, 4}}, node labels are given as $\phi(0)=$ 3, $\phi(1)=$ 3, $\phi(2)=$ 3, $\phi(3)=$ 3, $\phi(4)=$ 6, and edge labels as $\phi(\{0,1\})=$ 47, $\phi(\{1,2\})=$ 47, $\phi(\{2,3\})=$ 47, $\phi(\{3,4\})=$ 50, the resulting logical interpretation for $G$ is:

$\displaystyle\mathtt{node(0,3).}$ $\displaystyle\mathtt{node(1,3).}$ $\displaystyle\mathtt{node(2,3).}$ $\displaystyle\mathtt{node(3,3).}$ $\displaystyle\mathtt{node(4,6).}$ $\displaystyle\mathtt{edge(0,1,47).}$ $\displaystyle\mathtt{edge(1,2,47).}$ $\displaystyle\mathtt{edge(2,3,47).}$ $\displaystyle\mathtt{edge(3,4,50).}$

Finally, an additional predicate active/0 is used to discriminate between positive and negative instances: a fact active. or neg(active). will be respectively added to each logical interpretation. It is clear that each graph $G$ and its logical representation are in a one-to-one relationship.

3.2 Learning a representation of significant subgraphs by probabilistic logic programming

Given the set of significant subgraphs $\mathcal{S}$ and the set of testable subgraphs $\mathcal{T}$ , LIPS learns rules that can probabilistically discriminate significant subgraphs $\mathcal{S}$ from non-significant subgraphs $\mathcal{T}\setminus\mathcal{S}$ by means of a compact probabilistic logic program (in particular, LPAD). In other words, the problem of significant subgraph mining is converted in a structure learning problem of probabilistic logic clauses imposing constraints on the labels and connection structure of the original subgraphs. We illustrate an overview of LIPS in Fig. 2. To learn LPADs we employ the SLIPCOVER algorithm, which is a state-of-the-art learning algorithm and has been successfully applied in various relational domains [5]. The reason for this choice is the fact that graph-structured datasets characterized by links between nodes are inherently relational. We briefly summarize SLIPCOVER in the following and give a detailed execution example in the next subsection for a better understanding of its behavior.

Figure 2.

Overview of the proposed method LIPS.

Input consists of a set of logical interpretations $I$ , i.e. sets of ground facts as seen in Subsection 3.1, corresponding to the testable subgraphs which need to be discriminated. The algorithm is targeted at discriminative learning, that is the user has to indicate which predicate(s) of the domain is/are target, the one(s) for which we are interested in good predictions. The interpretations must contain also negative facts for target predicates. The ground atoms for the target predicates represent the positive and negative examples (queries) for which Binary Decision Diagrams will be built, encoding the disjunction of their explanations.

SLIPCOVER is built upon a two-phase search strategy: (1) a beam search in the space of clauses in order to find a set of promising clauses and (2) a greedy search in the space of theories. In the first phase the beam for each target predicate is initialized with a number of bottom clauses built as in Progol [21], which are repeatedly refined according to a “language bias”. The second phase starts from an empty theory which is iteratively extended with one target clause at a time from those generated in the previous phase. If the log-likelihood (LL) of the new theory does not increase, SLIPCOVER discards the clause, otherwise it adds it to the current theory.

BDDs are employed to efficiently perform the parameter learning phase of the LPAD, i.e. to compute the optimum probabilities for the clauses’ heads. This is done by the algorithm EMBLEM [4], based on an Expectation Maximization (EM) cycle. Both parameter and structure learning use the log-likelihood of the data as the guiding heuristic to find the best parameters and the best theory. This guarantees that the final LPAD returned by SLIPCOVER locally maximizes the (log-)likelihood with respect to the set of positive and negative examples (subgraphs) for the target predicate(s). LIPS is shown in Algorithm 2, while a simplified version of SLIPCOVER (relevant for the understanding of the method) in Algorithm 2.

[1]#1: [ht]

Function LIPS[1] LIPS $\mathcal{G},\mathcal{G}^{\prime},\textit{target}$ $(\mathcal{T},\mathcal{S})=$ MineSignificantSubgraphs $\mathcal{G},\mathcal{G}^{\prime}$ $\mathcal{L}=$ Slipcover $I,\textit{target}$ Input interpretations $I=\mathcal{T}\cup\mathcal{S}$ ; $\mathcal{L}$ : learned LPAD

return $\mathcal{L}$

[1]#1:

[ht]

Function SLIPCOVER

[1] SLIPCOVER $I,\textit{target}$ $IBs=$ InitialBeams $I,\textit{target}$ Beam search returns a set of beams, one for each target predicate $TC\leftarrow[]$ TC: list of promising clauses with target predicate in the head $\textit{Beam}\in\textit{IBs}$ $\textit{Steps}\leftarrow 1$ $\textit{NewBeam}\leftarrow[]$ Beam is not empty Remove the first BC from BeamBC: Bottom Clause $\textit{Refs}\leftarrow$ ClauseRefinementsBCFind all refinements Refs of BC $\textit{Cl}\in\textit{Refs}$ Cl:refined clause $(\textit{LL',Cl'})\leftarrow$ EMBLEMCl Parameter learning: updates Cl’s parameters and computes the log-likelihood LL’ of the new clause Cl’ $\textit{NewBeam}\leftarrow$ InsertCl’,LL’,NewBeam

$\textit{TC}\leftarrow$ InsertCl’,LL’,TC $\textit{Beam}\leftarrow\textit{NewBeam}$ $\textit{Steps}\leftarrow\textit{Steps}+1$ $\textit{Steps}>NI$ NI: max number of iterations $\mathcal{L}\leftarrow\emptyset$ , $\textit{LL}_{\mathcal{L}}\leftarrow-\infty$ Greedy search: initial LPAD empty Remove the first couple $(\textit{Cl,LL})$ from TC $(\textit{LL'},\mathcal{L}^{\prime})\leftarrow$ EMBLEM $\mathcal{L}\cup\{Cl\}$ $\textit{LL'}>\textit{LL}_{\mathcal{L}}$ $\mathcal{L}\leftarrow\mathcal{L}^{\prime}$ , $\textit{LL}_{\mathcal{L}}\leftarrow\textit{LL'}$ TC is empty return $\mathcal{L}$

3.3 Execution example

The analyzed domains are graphs of chemical compounds, with nodes labeled according to the atom type (predicate $\mathtt{node/2}$ ) and edges that represent the bonds (predicate $\mathtt{edge/3}$ ). Significant subgraph mining produces testable subgraphs including significant ones from a given set of graphs, and our method summarizes them. The target predicate is $\mathtt{active/0}$ , where significant subgraphs are encoded as active and testable but not significant subgraphs are encoded as non-active. A language bias is given to specify modeh/modeb declarations for building the bottom clauses and their refinements. Such declarations are templates for literals in the head or body of a clause [21]. For all the considered domains the language bias has been defined as follows.

$\displaystyle\mathtt{output(active/0).}$ $\displaystyle\mathtt{input(node/2).}$ $\displaystyle\mathtt{input(edge/3).}$ $\displaystyle\mathtt{modeh(1,active).}$ $\displaystyle\mathtt{modeb(*,node(-node,-label)).}$ $\displaystyle\mathtt{modeb(*,edge(+node,-node,-label)).}$ $\displaystyle\mathtt{modeb(*,edge(-node,+node,-label)).}$ $\displaystyle\mathtt{modeb(*,edge(+node,+node,-label)).}$

output indicates the target predicate, while input the other predicates of the domain, together with their arity. The modeh predicate indicates that at most 1 occurrence of $\mathtt{active}$ must be used in the clauses’ heads. The modeb predicates indicate that any number of occurrences (*) of the $\mathtt{node}$ and $\mathtt{edge}$ predicates can be used to build the clauses’ bodies. Each different graph is represented by a different input FOL interpretation.

The output of SLIPCOVER consists of LPAD theories whose clauses predict the target predicate with a given probability $\Pi_{i1}$ , as a function of a specific configuration of typed nodes connected by typed edges. $\Pi_{i1}$ indicates the probability of the first head atom (the only one, since we specified modeh(1,active)) for each clause $i$ . An example of LPAD (composed of a single clause) returned by the algorithm on one of the domains (MUTAG_5) is:

$\displaystyle\mathtt{active:0.0411023:-node(A,B),edge(A,C,D),node(C,B),edge(C,% E,D),edge(C,F,D),}$ $\displaystyle\quad\mathtt{node(E,B),node(F,B).}$

The clause states that a subgraph where some nodes A, C, E, F are of type B, and there is a direct edge between nodes A and C, C and E, C and F of type D is testable significant (active), which is illustrated in Fig. 3. The figure shows how several subgraphs can be compactly represented by a single clause. The value $\Pi_{11}=$ 0.0411023 represents the probability with which each grounding satisfying the clause body (i.e., an explanation) is active. This means that if there are more explanations, the probability of describing a significant subgraph pattern increases, according to the formula of the probability of the union of compatible events ( $P(A\cup B)=P(A)+P(B)-P(A\cap B)$ for two events A and B). In this example, by asking the probability of two positive (active) interpretations in the test set we get $P=$ 0.2226 for both, while for two negative interpretations (labelled as neg(active)) in the test set we get probabilities $P=$ 0.0411 and 0.0805, which are much smaller than 0.2226. The clause has assigned larger probabilities to the positive instances (significant subgraphs).

4. Experimental validation

In this section we empirically evaluate the performance of LIPS on standard graph datasets. These datasets include: MUTAG, NCI1 and NCI109.2

²
These datasets are available at https://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels/weisfeiler-lehman-graph-kernels.html.

The MUTAG dataset consists of graphs representing 188 chemical compounds, and aims to predict whether each compound shows mutagenicity. The NCI1 and NCI109 datasets consist of graphs representing two balanced datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines respectively. For each dataset, we first enumerated the set of testable subgraphs by the significant subgraph mining algorithm technique [33].3

Implementation is available at https://github.com/BorgwardtLab/significant-subgraph-mining.

We created different versions of testable subgraphs from these datasets with a different maximum number

N\in\{3,5,10\}

of subgraph nodes. In Table 1, we denote by “

\_N

” following the dataset’s name the maximum number

N

of nodes. For example, MUTAG_5 means that the number of nodes of every testable subgraph is less than or equals to 5. More information about the characteristics of the datasets is shown in Table 1.

Table 1

Characteristics of graph datasets. #Pos ex denotes the number of positive examples (significant subgraphs), while #Neg ex denotes the number of negative examples (testable but non-significant subgraphs). The number of ground facts (column 5) does not include facts for the target predicate. #Graphs is the total number of subgraphs in each dataset

Datasets	#Pos ex	#Neg ex	#Ground facts	#Graphs
MUTAG_5	8	221	2121	188
MUTAG_10	1054	2277	67065	188
NCI1_3	121	254	2338	4110
NCI1_10	83,300	133,606	4,620,649	4110
NCI109_3	118	250	2293	4127

Figure 3.

The left graph corresponds to $\mathtt{node(A,B),edge(A,C,D),node(C,B),edge(C,E,D),edge(C,F,D),node(E,B)}$ , $\mathtt{node(F,B)}$ . If the domain contains three node labels (white, grey, black) and two edge labels (straight, zigzag) then there could be six possible subgraphs (instances).

For training and test we employed a $k$ -fold cross validation, with $k$ depending on the dataset.

In order to verify the effectiveness of the approach, we tested the same LPADs learnt by SLIPCOVER where parameters were replaced with fixed and randomly chosen values, i.e. we tested the same clauses with non-optimized parameters on the datasets. All experiments were performed on GNU/Linux machines with an Intel Xeon Haswell E5-2630 v3 (2.40 GHz) CPU with 128 GB RAM.

4.1 Learning

In order to reach a compromise between accuracy (performance) and learning time, we tuned the following SLIPCOVER settings: the type of semantics (standard or simplified), the limit on the number of different solutions retrieved when computing the probability of a query, the maximum number of theory search iterations, the maximum number of clause search iterations, the size of the beam, the maximum number of variables in a rule.

For MUTAG_10 and NCI1_10, we employed a 10-fold cross-validation due to the large number of positive and negative examples. For NCI1_3 and NCI109_3 we employed a 5-fold cross-validation. For MUTAG_5 we employed a 4-fold cross-validation due to the presence of only 8 positive examples. This information is reported in Table 2.

As an example, we report in the following the output of the learning algorithm on one fold of the MUTAG_10 dataset, a LPAD composed of 3 probabilistic logical clauses:

active:0.00245023 :- node(A,B),edge(C,A,D),node(E,B),edge(A,F,D), node(G,B),edge(A,E,D),edge(G,H,D),edge(H,I,D), node(H,B),edge(I,J,D),node(J,B),node(F,B). active:0.00372679 :- node(A,B),edge(C,A,D),node(E,B),edge(A,F,D), node(G,B),edge(A,E,D),edge(G,H,D),edge(H,I,D), node(C,B),edge(I,J,D),node(H,B),node(I,B), node(J,B),node(F,B). active:0.00488592 :- node(A,B),edge(C,A,D),node(E,B),edge(A,F,D), node(G,B),edge(A,E,D),edge(G,H,D),edge(H,I,D), node(C,B),node(H,B),node(I,B).

4.2 Test

We computed the Area Under the Precision-Recall Curve (AUC-PR) for the probabilistic logic programs learned by SLIPCOVER and for the programs with the same structure but fixed parameters (“baseline”).

The Precision-Recall Curves have been obtained by collecting the testing examples, together with the probabilities assigned to them in testing by the LPADs, in a single set and then building the curves with the method reported in [9].

Table 2 shows the comparison between LIPS (in terms of area under the PR curve, SLIPCOVER learning time, number of LPAD clauses and number of body literals per clause, all averaged over the folds) and the baseline (in terms of area under the PR curve averaged over the folds). The comparison with a baseline of LPADs with fixed parameters demonstrates that LIPS successfully learned a more accurate summarization for significant subgraphs in a short time in all cases except for the NCI1_10 dataset. Even in the case of NCI1_10, composed of thousands of examples, the obtained AUCPR is larger than the baseline.

As for the number of clauses and body literals of the learned LPADs, Table 2 shows that we can get a very concise description of significant subgraphs, with less than 8 clauses in the analyzed domains, and with short clauses in most cases.

The possibility of tackling the problem of summarizing significant subgraphs by means of the SLIPCOVER system comes from the relational nature of the tested domains and from the discriminative learning setting of the algorithm; this property has been exploited to distinguish significant from non-significant subgraphs by means of a target predicate (active) in the first-order logic representation.

Table 2
Results of the experiments comparing LIPS with a baseline of LPADs with fixed parameters for each dataset, in terms of average Area Under the PR Curve (AUC-PR), SLIPCOVER average execution time (in seconds), average number of LPAD clauses and average number of body literals per clause. Column “Folds” specifies the number of folds used for cross-validation

Dataset	Folds	Baseline	LIPS
		AUC-PR	AUC-PR	Time(s)	Clauses	Literals
MUTAG_5	4	0.66	0.82	156.15	1	7.25
MUTAG_10	10	0.61	0.73	0.38	3.9	12.28
NCI1_3	5	0.40	0.41	33.86	5.6	3
NCI1_10	10	0.38	0.48	58889.45	2.3	2
NCI109_3	5	0.34	0.45	35.67	7.2	3.56

5. Related work

Significant pattern mining have been firstly achieved in [35] in the context of itemset mining [2, 3] using the Tarone’s testability trick [34] and further developed in [19, 36] by considering Westfall-Young permutation test [39]. Such methods can find all statistically significant patterns from databases while rigorously controlling the FWER. Recently, significant pattern mining has been applied to various data including interval data [17] and datasets with categorical covariates [23]. The software library of significant pattern mining is available [18] and an efficient parallelized implementation is also available [41].

Sugiyama et al. [33] applied significant pattern mining to graph structured data using subgraph mining algorithms [22, 40] and established significant subgraph mining. The technique can find all statistically significant subgraphs while controlling the FWER. However, since significant subgraph mining tends to generate a huge number of significant subgraphs, how to summarize such subgraphs is a challenging task for further analysis in applications. This paper has addressed the problem using probabilistic logic programming for the first time. The ProbLog language was employed in the context of local query mining from probabilistic biological databases [16], but with a different goal than significant subgraph mining.

6. Conclusions

We proposed the first method to find a compact general representation of a collection of significant subgraphs in the form of (probabilistic) logic programs. This was achieved through the application of a structure learning algorithm for probabilistic logic programs, SLIPCOVER, to standard graph-based datasets. The key idea is to formulate the problem of summarization of significant subgraphs as classification of testable and significant subgraphs to allow the application of learning algorithms of probabilistic logic programs. Experiments show that we can massively compress the set of significant subgraphs with reasonably high precision and recall. Since significant pattern summarization is the problem of not only subgraph mining but the general setting of pattern mining including itemset mining and sequence mining, our approach combining significant pattern mining and probabilistic logic programming has an impact to a wider range of applications of significant pattern mining. For instance, the approach might be applied to many interesting applications in chemoinformatics, structural biology, and precision medicine. In the future, we plan to apply other structure learning algorithms of logic programs targeted to big knowledge bases in order to reduce the computational time.

Footnotes

Acknowledgments

This work has been done in the context of a research agreement between the University of Ferrara (Italy) and the National Institute of Informatics (Tokyo, Japan). Mahito Sugiyama has been supported by JSPS KAKENHI Grant Numbers JP16K16115, JP16H02870, and JST, PRESTO Grant Number JPMJPR1855, Japan. Elena Bellodi has been partially supported by the Italian National Group of Computing Science (GNCS-INDAM).

References

Aggarwal

C.C.

and Han

, editors. Frequent Pattern Mining, Springer, 2014.

Agrawal

Imieliński

and Swami

, Mining association rules between sets of items in large databases, In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216.

Agrawal

and Srikant

, Fast algorithms for mining association rules, In Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.

Bellodi

and Riguzzi

, Expectation Maximization over Binary Decision Diagrams for probabilistic logic programs, Intelligent Data Analysis 17(2) (2013), 343–363.

Bellodi

and Riguzzi

, Structure learning of probabilistic logic programs by searching the clause space, Theory and Practice of Logic Programming 15(2) (2015), 169–212.

Berman

H.M.

Westbrook

Feng

Gilliland

Bhat

T.N.

Weissig

Shindyalov

I.N.

and Bourne

P.E.

, The protein data bank, Nucleic Acids Research 28 (2000), 235–242.

Bolton

E.E.

Wang

Thiessen

P.A.

and Bryant

S.H.

, PubChem: integrated platform of small molecules and biological activities, Annual Reports in Computational Chemistry 4 (2008), 217–241.

Costa

V.S.

Page

Qazi

and Cussens

, CLP(BN): constraint logic programming for probabilistic knowledge, CoRR, abs/1212.2519, 2012.

Davis

and Goadrich

, The relationship between precision-recall and ROC curves, In Proceedings of the 23rd International Conference on Machine Learning, 2006, pp.233–240.

10.

De Raedt

Kimmig

and Toivonen

, ProbLog: A probabilistic Prolog and its application in link discovery, In Proceedings of the 20th International Joint Conference on Artificial Intelligence, volume 7, 2007, pp. 2462–2467.

11.

Finn

Muggleton

Page

and Srinivasan

, Pharmacophore discovery using the inductive logic programming system progol, Machine Learning 30(2) (1998), 241–270.

12.

Gutmann

Thon

and De Raedt

, Learning the parameters of probabilistic logic programs from interpretations, In Gunopulos

Hofmann

Malerba

and Vazirgiannis

, editors, European Conference on Machine Learning and Knowledge Discovery in Databases, volume 6911 of LNCS, Springer, 2011, pp. 581–596.

13.

Han

Pei

and Yin

, Mining frequent patterns without candidate generation, In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 1–12.

14.

Inokuchi

Washio

and Motoda

, An apriori-based algorithm for mining frequent substructures from graph data, In Principles of Data Mining and Knowledge Discovery, volume 1910 of LNCS, Springer, 2000, pp. 13–23.

15.

Jin

Young

and Wang

, Graph classification based on pattern co-occurrence, In Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 573–582.

16.

Kimmig

and De Raedt

, Local query mining in a probabilistic Prolog, In Proceedings of the 21st International Joint Conference on Artificial Intelligence, volume 2, 2009, pp. 1095–1100.

17.

Llinares-López

Grimm

D.G.

Bodenham

D.A.

Gieraths

Sugiyama

Rowan

and Borgwardt

K.M.

, Genome-wide detection of intervals of genetic heterogeneity associated with complex traits, Bioinformatics 31(12) (2015), i240–i249.

18.

Llinares-López

Papaxanthos

Roqueiro

Bodenham

and Borgwardt

, CASMAP: detection of statistically significant combinations of SNPs in association mapping, Bioinformatics 12 (2018).

19.

Llinares-López

Sugiyama

Papaxanthos

and Borgwardt

K.M.

, Fast and memory-efficient significant pattern mining via permutation testing, In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2015, pp. 725–734.

20.

Meert

Struyf

and Blockeel

, Learning ground CP-Logic theories by leveraging Bayesian network learning techniques, Fundamenta Informaticae 89(1) (2008), 131–160.

21.

Muggleton

, Inverse entailment and Progol, New Generation Computing 13 (1995), 245–286.

22.

Nijssen

and Kok

J.N.

, A quickstart in frequent structure mining can make a difference, In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 647–652.

23.

Papaxanthos

Llinares-Lopez

Bodenham

and Borgwardt

K.M.

, Finding significant combinations of features in the presence of categorical covariates, In Advances in Neural Information Processing Systems, volume 29, 2016, pp. 2271–2279.

24.

Pei

Han

Mortazavi-Asl

Pinto

Chen

Dayal

and Hsu

M.-C.

, PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth, In Proceedings of the 17th International Conference on Data Engineering, 2001, pp. 215–224.

25.

Poole

, The Independent Choice Logic and beyond, In De Raedt

Frasconi

Kersting

and Muggleton

, editors, Probabilistic Inductive Logic Programming, volume 4911 of LNCS, Springer Berlin Heidelberg, 2008, pp. 222–243.

26.

Riguzzi

, Speeding up inference for probabilistic logic programs, The Computer Journal 57(3) (2014), 347–363.

27.

Riguzzi

and Swift

, Tabling and answer subsumption for reasoning on logic programs with annotated disjunctions, In International Conference on Logic Programming, volume 7 of LIPIcs, 2010, pp. 162–171.

28.

Riguzzi

and Swift

, The PITA system: Tabling and answer subsumption for reasoning under uncertainty, Theory and Practice of Logic Programming 11(4–5) (2011), 433–449.

29.

Riguzzi

and Swift

, Well-definedness and efficient inference for probabilistic logic programming under the distribution semantics, Theory and Practice of Logic Programming 13(2) (2013), 279–302.

30.

Sato

, A statistical learning method for logic programs with distribution semantics, In Proceedings of the 12th International Conference on Logic Programming, 1995, pp. 715–729.

31.

Sato

, A glimpse of symbolic-statistical modeling by PRISM, Journal of Intelligent Information Systems 31(2) (2008), 161–176.

32.

Sato

and Kameya

, Parameter learning of logic programs for symbolic-statistical modeling, Journal of Artificial Intelligence Research 15 (2001), 391–454.

33.

Sugiyama

Llinares-López

Kasenburg

and Borgwardt

K.M.

, Significant subgraph mining with multiple testing correction, In Proceedings of the 2015 SIAM International Conference on Data Mining, 2015, pp. 37–45.

34.

Tarone

R.E.

, A modified Bonferroni method for discrete data, Biometrics 46(2) (1990), 515–522.

35.

Terada

Okada-Hatakeyama

Tsuda

and Sese

, Statistical significance of combinatorial regulations, Proc Natl Acad Sci USA 110(32) (2013), 12996–13001.

36.

Terada

Tsuda

and Sese

, Fast Westfall-Young permutation procedure for combinatorial regulation discovery, In 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2013, pp. 153–158.

37.

Vennekens

Verbaeten

and Bruynooghe

, Logic Programs With Annotated Disjunctions, In International Conference on Logic Programming, volume 3131 of LNCS, Springer, 2004, pp. 195–209.

38.

Vreeken

van Leeuwen

and Siebes

, KRIMP: Mining itemsets that compress, Data Mining and Knowledge Discovery 23(1) (2011), 169–214.

39.

Westfall

P.H.

and Young

S.S.

, Resampling-based multiple testing: Examples and methods for p-value adjustment, John Wiley & Sons, 1993.

40.

Yan

and Han

, gSpan: Graph-based substructure pattern mining, In Proceedings of 2002 IEEE International Conference on Data Mining, 2002, pp. 721–724.

41.

Yoshizoe

Tsuda

and Terada

, MP-LAMP: parallel detection of statistically significant multi-loci markers on cloud platforms, Bioinformatics 34(17) (2018), 3047–3049.

42.

Zaki

M.J.

and Meira

, Jr., Data Mining And Analysis, Cambridge, 2016.

Summarizing significant subgraphs by probabilistic logic programming

Abstract

Keywords

1. Introduction

2.1 Probabilistic Logic Programming

.

3. The proposed method: LIPS

3.1 First-order logic representation of subgraphs

3.2 Learning a representation of significant subgraphs by probabilistic logic programming

4. Experimental validation

2 These datasets are available at https://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels/weisfeiler-lehman-graph-kernels.html.

4.2 Test

6. Conclusions

Footnotes

Acknowledgments

References

²
These datasets are available at https://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels/weisfeiler-lehman-graph-kernels.html.