SparkHINlog: Extension of SparkDatalog for heterogeneous information network

Abstract

Real world data is often interconnected, forming large and complex heterogeneous information networks (HINs) with multiple types of objects and links such as bibliographic network (DBLP) and knowledge bases (YaGo). Querying meta-paths requires exploration of path instances which can be computational cost in large HINs. However, existing meta-path based studies mostly focus on analytical applications of meta-paths, rather than systems to query meta-paths efficiently in large HINs. To bridge this gap, in this work we present SparkHINlog, a system based on Apache Spark, to handle meta-paths queries efficiently on large scale HINs. In SparkHINlog we propose an algorithm to not only translate meta-paths to Datalog rules, but also to manage the working memory area of Datalog efficiently to increase the scalability of SparkHINlog. To avoid the computing overhead of join operation to discover path instances when evaluating these rules, we leverage Motif Finding, a powerful tool of GraphFrames Library. With motif finding, SparkHINLog can speed up the time to evaluate the rules by path finding on graph instead on joining two relations. We conduct experimental comparisons with SparkDatalog, the state-of-the-art large-scale Datalog system, and verify the efficacy and effectiveness of our system in supporting meta-path queries.

Keywords

Bibliographic network datalog rules heterogeneous information networks meta-path spark graphframes

1 Introduction

Heterogeneous Information Network (HIN) is logical networks containing multiple types of objects and multiple types of links denoting different relations [1]. Real world data is often represented as a HIN, for example, bibliographic networks. In bibliographic networks, vertices have several vertex types such as authors, papers, venues, topics, words. These vertex types are connected via various link types such as write relation (author–[writes]⟶paper), cites relation (paper–[cites]⟶paper), publishedAt relation (paper-[publishedAt]⟶venue), to name but a few.

One of the widely-used concepts in mining HINs is meta-path, which explains how two vertices are related. A meta-path is a sequence of vertex types and link types connecting two given vertices in a HIN [2]. For example, author–[writes] → paper is a meta-path between two vertex types author and paper, capturing relationship that an author writes a paper. Another example is author–[cites] → paper ← [writes]–author. This meta-path explains the relationship between two authors that the former author cites one of the papers of the latter. In HIN research, meta-path plays an important role in many analytic applications such as Similarity Measure (PathSim [2], HeteSim [3], Clustering [4], Classification [5]). A common operation of these studies is to evaluate meta-path queries on two vertices, which explore all path instances connecting the vertices pair of interest regarding a given meta-path. Evaluating these recursive queries can be computational cost on large scale HINs. While extant meta-path based studies mostly focus on analytical applications of meta-paths [6] there is a need for a tool to query meta-paths efficiently in large HINs.

The growing demand in large scale graph analysis has encouraged recent scalable technologies on graph data sources, in particular Datalog systems. Datalog is a declarative logic programming language based on Prolog, which is used as a recursive query language in deductive databases. In graph databases, Datalog can support various recursive queries such as finding all pairs of vertices connected by some paths in a graph, or finding all vertices connected by some paths to a given source vertex. For its high-level semantics enabling natural expression of graph analysis, Datalog has been widely studied in distributed settings of large-scale monotonous graphs with single vertex types and link types [7, 8].

In this paper we propose to leverage Datalog to query meta-paths efficiently in large HINs. First we translate meta-paths to Datalog rules. We then study how to evaluate these rules efficiently. Datalog rules are building blocks of a Datalog program. A Datalog rule has a form of < head>:-< body>, in which <head>is a predicate, and <body>is a set of predicates. A predicate is often considered as a relation among set of objects. For example, in a bibliographic network, we have a predicate Write(X, P) describing the relation that author X writes paper P. Likewise, WrittenBy(P, Y) describes the “reverse” relation that paper P is written by author X. As link types in HIN also describe relations between two vertex types, we can use predicates to represent link types. Hence, Datalog rules can capture the same relation described by meta-paths.

Let us consider the meta-path author–[cites] → paper ← [writes]–author as an example. We can translate author–[cites] → paper and paper ← [writes]–author to two predicates Write(X, P) and WrittenBy(P, Y) respectively. The whole meta-path is therefore corresponding to the following Datalog rules: CoAuthor(X,Y):=Write(X,P), WrittenBy(P,Y). CoAuthor(X, Y) is an Intensional Database (IDB) predicate, which is generated by joining relation Write(X,P) with relation WrittenBy(P,Y) based on the common attribute P. This rule shares the same co-authorship relation as the meta-path: author X and author Y are co-author if they write a common paper.

Sometimes the relation in HIN can be recursive, which is denoted as (. . → . .) ^k. For example, in Bibliographic Network the meta-path (P → P) ^k represents an indirect citation between two papers. A path instance of this meta path is p₁ → p₂ → … p_k where paper p₁ cites paper p₂, paper p₂ cites p₃, … paper p_k-1 cites paper p_k. We can also use Datalog rules to represent this recursive relation.

After translating meta-paths to Datalog rules, we study how to evaluate these Datalog rules efficiently. In traditional Datalog systems, Datalog rules with IDB predicates can be computational cost due to the join operation among the relations in the body of the rules. We observe that in HIN context all tuples satisfying Datalog rules would also form paths in graphs. Hence we cast the problem of evaluating Datalog rules to path finding in graphs. In particular, we utilize the motif finding function of GraphFrames Library to find paths in the graph instead of joining two relations. GraphFrames Library is a package in Apache Spark [9], a system supporting large-scale analytics in distributed environment. Therefore, we can execute finding path tasks efficiently in large-scale HINs.

To the best of our knowledge, no existing research has addressed how to use Datalog System to process meta-paths of HINs. We hold that a HIN is a graph. Therefore, we would like to focus on using Datalog for HIN analysis. Moreover, HIN is often large-scale graph. We leverage the GraphFrames Library of Apache Spark to improve the performance of HIN. To this end, we develop SparkHINlog which can analyze efficiently large scale HIN. Our contributions through SparkHINlog are as follows:

We propose a systematic way of translating meta-path to Datalog rules. Based on these Datalog rules, SparkHINlog can support fundamental meta-path related tasks such as finding the instances of meta paths, evaluating PathSim [2], one of the essential meta-path based similarity scores in HIN analysis.

SparkHINlog can process recursive query. We use this capability to process the recursive queries in HIN such as the influence of author or paper through citation, the propagation of topics through citation, etc. We address querying recursive meta-paths by converting them to recursive Datalog rules. Besides, SparkHINlog is the combination of Datalog and Spark GraphFrames in a distributed environment, so it can utilize the capability of Datalog recursive rules and distributed computing of Spark GraphFrames for large scale HIN analysis.

We also enhance SparkHINlog to make the system scalable in working with many Datalog rules in large scale HIN. Specifically, SparkHINlog can scan the rules to decide whether a relation is retained in working memory of Datalog or not. If a relation is not used in the body part of any future rules (rules have not yet been evaluated), they will not be retained in the working memory area of SparkHINlog. This action will free the memory space for unused relations. Therefore, the system can increase the capability of scalability when working with many Datalog rules.

SparkHINlog is capable of processing large scale graph by using motif finding of GraphFrames Library. Motif finding allows us to find paths in the Graph efficiently on distributed computing environment.

The rest of this paper is organized as follows 1) Introduction 2) Preliminaries 3) Methodology 4) Experiment and Discussion 5) Related Work and 6) Conclusion and Future Work.

2 Preliminaries

2.1 Heterogeneous information network

According to [2], we have the following definitions:

Definition 1. An Heterogeneous information network is defined as a directed graph G = (V, E) with an object type mapping function φ : V → A and a link type mapping function ψ : E → R, where each object v ∈ V belongs to one particular object type φ (v) ∈ A, and each link e ∈ E belongs to a particular relation ψ (e) ∈ R and the types of objects |A|>1 or the types of relations |R|>1.

Example 1. DBLP is a typical heterogeneous information network. We use LDA topic modeling to discover [10] topics of documents in DBLP. So, our DBLP HIN has vertex types and link types as shown in Tables 1 and 2.

Table 1
Vertex types of the DBLP HIN

No. VertexTypes Descriptions

1 AUTHOR Authors of papers

2 PAPER Papers

3 TOPIC Topics of papers

4 VENUE Venues at which papers are published

5 WORD Keywords of papers

No.	VertexTypes	Descriptions
1	AUTHOR	Authors of papers
2	PAPER	Papers
3	TOPIC	Topics of papers
4	VENUE	Venues at which papers are published
5	WORD	Keywords of papers

Table 2

Link types of the DBLP HIN

No.	Relationship Types	Descriptions
1	CITES	PAPER – CITES – PAPER
2	CONTAINS	TOPIC – CONTAINS – WORD
3	PUBLISHEDAT	PAPER – PUBLISHEDAT – VENUE
4	RELEVANT	PAPER – RELEVANT – TOPIC
5	WRITES	AUTHOR – WRITES – PAPER

Definition 2. Network schema. The network schema is a meta template for a heterogeneous network G = (V, E) with the object type mapping φ : V → A and the link mapping ψ : E → R, which is a directed graph defined over object types A, with edges as relations from R, denoted as T_G = (A, R).

Example 2. Bibliographic network schema. For a bibliographic network defined in Example 1, the network schema is shown in Fig. 1. Links exist between authors and papers denoting the writing or written-by relations, between venues and papers denoting the publishing or published_at relations, between papers and topic denoting the topic of paper, and between papers, denoting citing or cited-by relations, between papers and topic and between topic and words.

Fig.1

Bibliographic network.

Definition 3. A meta-path P is a path defined on the graph of network schema T_G = (A, R), and is denoted in the form of $A_{1} \overset{R 1}{\to} A_{2} \overset{R 2}{\to} \overset{Rl}{. . \to} A_{1 + 1}$ , which defines a composite relation R = R°R2° … °R1 R between type A₁ and A_l+1, where ° denotes the composition operator on relations.

The length of P is the number of relations in P. Further, we say a meta-path is symmetric if the relation R defined by it is symmetric. Path count is the number of path instances p between x and y following. Path count is defined as P : s (x, y) = | {p : p ∈ P} |.

Definition 4. The commuting matrix MP of the meta-path P $A_{1} \overset{R 1}{\to} A_{2} \overset{R 2}{\to} \overset{Rl}{. . \to} A_{1 + 1}$ , is defined as: $MP = W_{A_{1} A_{2}} {xW}_{A_{2} A_{3}} \dots \dots {xW}_{A_{L - 1} A_{L}}$ where W_{A
_L-1
A
_L} denotes the relation matrix from A_1-1 to A₁

Different meta-paths carry different semantics. Meta-path A → P → A expresses the information “Two authors cooperate on a paper”. Meta-path A → P → V ← P ← A expresses the information “Two authors write a paper published on the same venue”. Meta-path P → , P → P means “Paper p1 cites paper p2 and paper p2 cites paper p3”.

Definition 5. PathSim is a meta-path based similarity measure. Given a symmetric meta path P, PathSim between two objects of the same type x and y is: $S (x, y) = \frac{2 \times | {p_{x \to y} : p_{x \to y} \in P} |}{| {p_{x \to x} : p_{x \to x} \in P} | + | {p_{y \to y} : p_{y \to y} \in P} |}$

Where | {p_x→y : p_x→y ∈ P} | is the number of path instances of meta-path P between entity x and entity y. | {p_x→x : p_x→x ∈ P} | is the number of path instances of meta-path P starting from y and | {p_y→y : p_y→y ∈ P} | is the number of path instances of meta-path P starting from y [2].

2.2 Datalog

Datalog is a Prolog-based nonprocedural query language which is first developed on the platform of relational database [11]. In this platform, a fact is an EDB (extensional database). It is a physical relation in relational database. An IDB (intensional database) is a derived relation and is generated by using a Datalog rules. Datalog has facts and rules. A fact is the same as tuple in a relation. A fact has the following structure <predicate name> (<list of constants>).

Example 3. The following sets of facts in HIN of DBLP

Vetex(326,Paper,“Enriching_Expressive_Power.”)

Vertex(327,Paper,“A_Lattice_Model.”)

Vertex(3753,Author,“Li_Gong”)

Vertex(3754,Author,“Xiaolei_Qian”)

Edge(52,80)

Edge(52,81)

Edge(52,82)

In relational database terms, each fact would be the same as the relation of database.

Besides of facts, Datalog has rules. Rule is a way to derive new facts. A rule has the following form <head>:-<body>.

Assume we have a rule: Q:- P. It means that if P is true then Q is true.

Normally, Datalog will translate the Datalog rules to relational operations for generating data. Some examples of facts and rules are listed as follows:

Facts

vertex(vertexId,vertexType,name)

edge(src,dst)

Relational Algebra: name $= \prod_{name} (vertex)$

Datalog Rule:

names(Name):-Vertex(vertexId,vertexType,name) Relational Algebra:

PaperName = $\prod_{name} (σ_{vertexType =^{″} {Paper}^{″}} (Vertex))$ Datalog Rule:

Paper(vertexid):- Vetex(vertexId,vertexType, name), vertexType="Paper"

Author(vertexid):-Vetex(vertexId,vertexType, name), vertexType="Author"

Venue(vertexid):- Vetex(vertexId,vertexType, name), vertexType="Venue"

Topic(vertexid):- Vetex(vertexId,vertexType, name), vertexType="Topic"

Word(vertexid):- Vertex(vertexId,vertexType, name), vertexType="Word"

Relational Algebra:

Write(src,dst)=Edge(src,dst) raisebox - 1pt⟕ Author(src) raisebox - 1pt⟕ Paper(dst)

Datalog Rule:

Write(src,dst):-Edge(src,dst), Author(src), Paper(dst)

In SparkHINlog, we used motif finding of Spark GraphFrames Library to access graph and generate data. SparkHINlog uses path query instead of join operation in Datalog.

Recursion in Datalog

Datalog have recursive rules [12]. In HIN, we use Datalog recursive rules to calculate the CitingSet. CitingSet of a paper p is a set of all papers that paper p cited directly or indirectly. In other words, the CitingSet of paper p is a set of papers which has an influence on paper p. We can compute the CitingSet of paper 2394 denoted CitingSet(2394) as follows:

Path(X,Y):- Edge(X,Y)

Path(X,Y):- Edge(X,Z), Path(Z,Y)

CitingSet(X,Y)=Path(X,Y),X=2394

With these rules, we can discover the path instances of all paths which connect paper 2394 to any papers in CitingSet(2394). We can use this recursive rule to find the influence of these papers to paper 2394. We will discuss this operation in the experiment section of this paper.

2.3 Spark datalog

Apache Spark is an open source software for distributed computing environment. It was developed at the University of California, Berkeley’s AMPLab Apache Spark in 2012 to improve the limitations in the Map Reduce cluster computing paradigm [13]. Apache Spark is a general platform for large-scale data analytics. Apache Spark can run with one master and several worker computers. Apache Spark has GraphX library which can process large scale graph on distributed computing environment.

Recently Datalog has found new applications in Graph Analysis [8]. We focus on SparkDatalog and enhance SparkDatalog to HIN analysis. Based on Apache Spark, we can improve the performance of rule evaluation by using a partition strategy for distributing the graph into several partitions. The join operation will process in distributed computing in each partition then grouping for the final result of join operation. SparkDatalog has a working memory area to store the temporary instances of relation during the evaluation process. Latter, we will develop an algorithm to manage efficiently this working memory area.

2.4 Motif finding of GraphFrames library

GraphFrames Library is a package of Apache Spark which provides DaraFrames query based Graphs [9]. GraphFrames Library provides powerful tools for processing large scale graph. Motif finding is a power tool to discover structural patterns in large scale graph.

In this research, we use motif finding to find the instances of a given meta-path. A meta-path is translated to Datalog rules. When evaluating the body predicates, we used motif finding of GraphFrames Library to find the path connecting these vertices.

GraphFrames Library uses simple Domain Specific Language (DSL) for expressing the structural queries. Motif pattern contains names of vertices and edges. The basic pattern is an edge, for example edge (a)–[e] → (p) where a is an author, p is an paper. Each pattern can be joined together to generate a path.

Example the pattern (a)–[e1] → (p); (p)–[e2] → (v) specifies a path from vertex a to vertex p then from vertex p to vertex v.

Motif finding can be executed by using Scala programming language and Apache Spark GraphFrames Library. The columns of DataFrames result can be specified by using the select phrase as follows:

val motif = g.find(“(a)–[e1] → (p); (p)–[e2] → (v)”)

.filter(“a.type=”Author” and

p.type=”Paper” and v.type=”Venue”)

.select(“a.id”,“b.id”,”c.id”)

When evaluating the Datalog Rules: APA(A,P,V):-AP(A,P), PV(P,V)

In traditional Datalog, we need the instances of relation AP and relation PV. Then we use join operation to create relation APA.

In GraphFrames Library, the instances of relation AP and relation PV are the edges of HIN satisfying the Motif Finding pattern. The join operation of two relations can be replaced by motif finding of GraphFrames Library for path finding. The attributes of DataFrames which is the result of motif finding can be selected by using the select phrase as above.

Using motif finding, we can save memory space for generating body relations Moreover, the time for Motif Finding is faster than the time for join operation based on the good partition strategy of GraphFrames Library. We will conduct experiment to show the result of processing.

With Scala programming language, we have the following code for motif finding:

val g = Graphs(sqlContext) # Get example graph # Search for pairs of vertices with edges in both directions between them.

val querypat="(a) -[e] → (b); (b)-[e2] → (a)"

val motifs = g.find(querypat)

motifs.show()

We developed an algorithm to translate the meta-path to Datalog rules, then Datalog rules to motif pattern of motif finding.

Example 4. With meta-path A → P ← A, we have the following Datalog rules.

WrittenBy(P,X):- Write(X,P)

CoAuthor(X,Y):- Write(X,P), WrittenBy(P,Y), X < > Y

These Datalog rules can be translated to motif pattern of motif finding as follows:

The motif pattern for relation Write(X,P)

querypat=“(a)–[w1] → (p)”

filterpat=“a.type=‘author’ and b.type=‘paper”’

val motifs = g.find(querypat).filter(filterpat)

motifs.show()

The rule: CoAuthor(X,Y):- Write(X,P),

WrittenBy(P,Y), X< >Y

This rule can be translated to motif finding:

querypat=“(a)–[w1] → (p); (p)–[w2] → (b)”

filterpat=“a.type=’author’ and b.type=’author’ and

p.type=’paper’ and a.id <> b.id”

val motifs = g.find(querypat).filter(filterpat)

motifs.show()

With these rules, we can process Datalog rules by using motif finding of Spark GraphFrames in distributed environment.

3 Methodology

3.1 SparkHINlog

We built our proposed SparkHINlog with the following requirements:

SparkHINlog can work with large scale graph for HIN

SparkHINlog can work with graph instead of relation by using Spark GraphFrames Library

The join operation on two relations in SparkDatalog will be enhanced by a good repartition for improving the performance of motif finding of GraphFrames Library.

The operations of SparkHINLog are as follows:

The input on SparkHINLog is meta-path such as A → P → A, A → P → V ← P ← A, A→ P → T ← P ← A …

The input meta-path is translated to Datalog Rules and store in Datalog Program.

The Datalog engine will evaluate the Datalog rules in Datalog Program.

SparkHINLog has following characteristics:

Use GraphFrames Library for processing large scale HIN.

Extract EDB and IDB predicates from HIN graph by using the motif finding of GraphFrames Library.

Manage smartly the temporary memory for deductive process in which EDB and IDB relation will be generated when needed and release when they aren’t used in another rules any more. This process will help to reduce the memory space for storing the temporary EBD, IDB.

Use a partition strategy of GraphFrames Library to reduce the execution time for join operation in deductive process

The structure of SparkHINlog is in Fig. 2.

Fig.2

Structure of SparkHINlog.

In a Datalog program, predicates can be either EDB or IDB where EDB is an extensional database and IDB is an intensional database. IDB is a relation generated by rules during evaluation process.

The algorithm to evaluate Datalog Programs can be summarized as follows.

If there is no recursive rule, we can pick an order to evaluate the IDB predicates, so that all the predicates in the body of its rules have already been evaluated.

Example 5. Given two Datalog rules as follows:

Rule 1: RAP(x,y):- AP(y,x).

Rule 2: APA(x,p,y):- AP(x,p),RAP(p,y)

In Rule 1, the predicate in rule body is predicate AP(y,x). It is an EDB predicate. The predicate RAP(x,y) is IDB, it is evaluated by swapping the value of attribute x and attribute y of AP.

In Rule 2, the predicate in the head is APA(x,y), it is an IDB predicate. The predicate RAP(p,y) and predicate AP(x,p) are evaluated. Then relation AP(x,p) will join with relation RAP(p,y) to generate the instance of APA(x,p,y) relation.

Let’s consider Datalog rule:

APV(A,P,V):- AP(A,P), PV(P,V)

With traditional Datalog, we need the instances of relation AP(A,P) and PV(P,V). Then we join relation AP(V,A) with relation PV(P,V). This operation needs the memory space for the instances of relation AP(V,A) with relation PV(P,V) if they are large scale relation, and the time for joining two large scale relations AP(V,A) and PV(P,V) will be high. When using the GraphFrames Library, the instance of relation AP(V,A) and PV(P,V) are the edges of HIN graph. The result of joining AP(A,P) and PV(P,V) is the set of path instances of path A → P → V. We use the motif finding of GraphFrames Library to generate these path instances as follows:

val motifs = g.find(g,” (a) - [e1] - > (p); (p) - > [e2] - > (v)”).

filter(“a.type=”Author” and

p.type=”Paper” and c.type=”Venue”)

.select (“a.id”,”p.id”,”v.id”)

If Datalog program contains recursive rules as follows:

path (x,y):- edge (x,y)

path(x,y):- edge(x,z), path(z,y)

SparkDatalog will repeatedly execute the join operation between relation edge(x,y) and relation path(x,y) until relation path(x,y) is an empty set. These operations are executed by motif finding of GraphFrames Library. In the experiment section, we will discuss this improvement to reduce the execution time.

3.2 Algorithm to translate meta path to Datalog rules and motif patterns

We proposed an algorithm to translate the meta-path to Datalog rules as follows:

Given a meta-path P and the reverse meta-path p^-1. Meta-path $A_{i} \overset{R}{\to} A_{j}$ expresses a binary relation R (A_i, A_j) of x in A_i and y in A_j.

The reverse of meta-path $A_{i} \overset{R}{\to} A_{j}$ is meta-path $A_{i} \overset{R^{- 1}}{\to} A_{j}$ . This meta-path expresses a binary relation R^-1 (A_j, A_i). We can use Datalog rules to generate the relation for a reverse meta-path. $R^{- 1} (A_{j}, A_{i}) : - R (A_{i}, A_{j})$

Example 6. Meta-path A → P expresses relation author “write” paper. The reverse of A → P meta-path is P ← A. It means that paper is written by author. The following Datalog rule can be used to generate relation PA as follows:

PA (P, A) : - RAP (A, P)

We use the join operation to process the composition of two relations.

With meta path $A_{i} \overset{R_{1}}{\to} A_{j} . . \overset{R_{2}}{\to} A_{k}$ . We join two relations R1 (A_i, A_j) and R2 (A_j, A_k) to generate the result as follows:

R1R2 (A_i, A_j, A_k) = R (A_i, A_j) raisebox - 1pt⟕ with R2(A_j, A_k)

Example 7. with meta-path A⟶P←A, we can translate this meta-path to the following Datalog rules:

APA (A, P, B) : - AP (A, P) raisebox - 1pt⟕ PA (P, B)

There are two cases for this meta-path.

Case 1. un-symmetric meta path

Given a meta path $A_{1} \overset{R_{1}}{\to} A_{2} . . . \overset{R_{2}}{\to} A_{l}$ . If string A₁A₂ . . . A_l is not symmetric, we can translate this meta-path by using the following steps.

For each A_iA_i+1, we generate a predicate A_iA_i+1 (A_i, A_i+1). The result predicate is the join of consecutive pair relations as follows: $\begin{matrix} A_{1} A_{2} . . . A_{k - 1} A_{k} (A_{1}, A_{2}, . . ., A_{k}) = A_{1} A_{2} (A_{1}, A_{2}) \\ A_{2} A_{3} (A_{2}, A_{3}) . . ., A_{k - 1} (A_{k - 1}, A_{k}) \end{matrix}$

Example 8. With meta path A⟶P⟶V, we generate the following Datalog rules to generate the result relation.

APV (A, P, V) : - AP (A, P) , PV (P, V)

Case 2. symmetric meta-path

Given a meta-path $A_{1} A_{2} . . . A_{l}^{'} A_{2}^{'} A_{3}^{'} . . . A_{l}^{'}$ or (PP^-1)

Where P = A₁A₂ . . . A_l and $P^{- 1} = A_{l} A_{2}^{'} A_{3}^{'} . . . A_{l}^{'}$ .

This meta-path is symmetric if and only if $A_{1} = A_{l}^{'}$ , $A_{2} = A_{l - 1}^{'}$ , ... We consider only the symmetric meta- path of the length is an odd number. Some typical symmetric meta-paths with the length is an odd number are as follows: A → P → A, A → P → V ← P ← A, A → P → T ← P ← A, P → V ← P, ... They play an important role in computing the similarity measure of same type objects.

For the symmetric meta-path, we need only to compute a half meta-path. In this case, we compute meta-path P. Then we use the reverse of P to generate meta-path P^-1.

The meta-path is decomposed into two smaller meta-paths based on the middle point of meta-path. This operation will be processed recursively until we have a length 2 meta-path. This meta-path will be an EDB or the reverse meta-path of EDB.

An example of translating meta-path A → P → V ← P ← A to Datalog rules is shown in Fig. 3. Meta path A → P → V ← P ← A, gives two smaller meta-paths A → P → V and V ← P ← A.

Fig.3

Post order tree traversal for generating Datalog rules.

Since meta-path P satisfies case 2, we generate the Datalog rules and stop.

APV (A, P, V) : - AP (A, P) , PV (P, V)

We use recursive process to decompose P into two more smaller meta- paths and do the same things we mentioned above.

Example 9. For meta path A → P → V ← P ← A, we decompose this meta-path into 2 smaller meta-paths. They are meta-path P = A → P → V and meta-path P^-1 = P ← V ← A.

For meta-path P = A → P → V, we have the following Datalog rule:

APV (A, P, V) : - AP (A, P) , PV (P, V)

For meta-path P^-1 = P ← V ← A, we have the following Datalog rule:

VPA (V, P2, A2) : - APV (V, P2, A2)

Then we merge two above rules to generate the last Datalog rules as follows:

APVPA (A, P, V, P, A) : - APV (A, P1, V) ,

VPA (V, P2, A2)

We use post-order tree traversal to collect the Datalog rules as shown in Fig. 3.

Example 10. An example of translation of meta-path APTW to Datalog rules and motif pattern is as follows:

The length of meta-path APTW is 4 (even), so we have:

AP(a0,p) is EDB (a0) - [AP] - > (p) PT(p,t1) is EDB (p) - [PT] - > (t1) TW(t1,w) is EDB (t1) - [TW] - > (w) APTW(a0,p,t1,w):-AP(a0,p),PT(p,t1),TW(t1,w) The EDB relations are generated by motif finding of GraphFrames Library. In this example EDB AP(a0,p) is generated by (a0)–[AP]⟶(p), relation PT(p,t1) is generated by (p)–[PT] → (t1), relation TW(t1,w) is generated by (t1)–[TW] → (w). We can use the recursive Datalog rules to compute the meta path (P → P) ^k. In this case, “cite” relation is expressed by using the meta-path P → P. This meta-path will be repeated by using recursive rules of Datalog as follows:

T1(x,y):- Cite(x,y).

T1(x,y):- Cite(x,z),T1(z,y).

We use this rule to find the influence of paper through CitingSet.

3.3 Manage smartly the working memory area of Datalog

Datalog uses a working memory area for storing all the generated relations in the evaluation process. With the following Datalog program in Table 3:

Table 3
Datalog Program

Rule no Rule

1 RAP(x, y):- AP(y, x)

2 RPV(x, y):- PV(y, x)

3 T1(x, p, v):- AP(x, p), PV(p, v)

4 T2(v, p, y):- RPV(p, v), RAP(v, y)

5 T3(x, p1, v, p2, y):- T1(x, p1, v), T2(v, p2, y),

EValue(x)CompareNe()EValue(y)

Rule no	Rule
1	RAP(x, y):- AP(y, x)
2	RPV(x, y):- PV(y, x)
3	T1(x, p, v):- AP(x, p), PV(p, v)
4	T2(v, p, y):- RPV(p, v), RAP(v, y)
5	T3(x, p1, v, p2, y):- T1(x, p1, v), T2(v, p2, y),
	EValue(x)CompareNe()EValue(y)

We evaluate rules of Datalog program step by step and check the size of SparkHINlog’s working area as shown in Table 4.

Table 4

Working memory area of the Datalog

Rule	Relations in working	Relations in working
no	memory area before	memory area after
	evaluating rules	evaluating rules
1	AP, PV.	AP, PV, RPV
2	AP, PV, RPV	AP, PV, RPV,RAP
3	AP, PV, RPV, RAP	AP, PV, RPV, RAP, T1
4	AP, PV, RPV, RAP,T1	AP, PV, RPV, RAP, T2, T1
5	AP, PV, RPV, RAP, T2, T1	AP, PV, RPV, RAP, T2, T1, T

We recognize that the size of working memory area is increased rapidly because the generation of the result of the predicates in the head of the Datalog rules.

We proposed an algorithm to manage the working memory area of SparkHINlog as follows:

Step 1. Check the Datalog Program and draw the dependency graph of Datalog rules

Step 2. Scan the dependency graph and check whether relations in the head and in the body of the current rule are in the body of future rules (rules have not yet been evaluated). If they are in future rules, they will be retained in working memory area. If they are not in the body of any future rules, they will be flushed out of the working memory area.

The steps of checking are as follows:

Check Rule 1. RAP(x, y):- AP(y, x)

We have the future rules of rule 1 are rules: 2,3,4,5

The head predicate of this rules contains predicate RAP. The predicate RAP is in the body of future rule, so it will be retained in the working memory area. The body of this rule contains AP. This predicate is in the body of future rules (rule 3), so it will be retained in the working memory area.

Check Rule 2. RPV(x, y):- PV(y, x)

We have the future rules of rule 2 are rules 3,4,5

The head predicate of this rules contains predicate RPV. The predicate RPV is in the body of future rule (rule 3,4,5), so it will be retained in the working memory area. The body of this rule contains PV. This predicate is in the body of future rules, so it will be retained in the working memory area.

Check Rule 3. T1(x, p, v):- AP(x, p), PV(p, v)

We have the future rules of rule 3 are rules: 4,5. The head predicate of this rules contains predicate T1. The predicate T1 is in the body of future rules (rule 4, rule 5), so it will be retained in the working memory area. The body of this rule contains AP, PV. These predicates are not in the body of any future rules, so it will be flushed out of working memory area.

Check Rule 4. T2(v, p, y):- RPV(p, v), RAP(v, y)

We have the future rules of rule 4 is rule 5.The head predicate of this rules contains predicate T2. The predicate T2 is in the body of future rule (rule 5), so it will be retained in the Working Memory area. The body of this rule contains RAP, RPV. These predicates are not in the body of any future rule (rule 5), so it will be flushed out of working memory area. The smart management of working memory is shown inTable 5.

Table 5

Smart Management of Working Memory

Rule	Relations in working	Relations in working
no	memory area before	memory area after
	evaluating rules	evaluating rules
1	AP, PV.	RPV, AP, PV
2	RPV, AP, PV	RPV, RAP, AP, PV
3	RPV, RAP, AP, PV	RPV, RAP, T1
4	RPV, RAP, T1	RPV, RAP, T2, T1
5	T2, T1	T2, T1, T

Based on this algorithm, we can save the working memory area in each step of evaluation process and enhance the capability of scalability.

4 Experiment and discussion

4.1 Configuration of the distributed system

We run our program which was developed on GraphFrames Library on Spark cluster with one master and five worker computers. The process of knowledge graph will be executed on worker computers and the result will be sent back to the master worker for summary. Normally, they are join operation. We repartition the edges of graph so that edges of the same source vertices are in the same partition. After repartioning, the join operation can be executed without moving the edges among partitions. So, the time execution of join operation will be reduced.

4.2 Data set

We use a subset of DBLP Citation network with 117,210 vertices and 529,968 citation relationships. We downloaded this data set from Aminer website (https://aminer.org/citation) and store DBLP in Neo4j graph database. We used Topic Modeling LDA to discover the topic of paper [10]. We run probabilistic Topic Model LDA with 30 topics to generate topics of papers.

4.3 Translate meta-path to Datalog rules

This is the result of our proposed algorithm to translate meta-path to Datalog rules with motif pattern.

=Translate meta-path to Datalog rules=

Meta-path APTW has even length (length = 4)

AP(a0,p) is EDB

(a0) - [AP] - > (p)

PT(p,t1) is EDB

(p) - [PT] - > (t1)

TW(t1,w) is EDB

(t1) - [TW] - > (w)

APTW(a0,p,t1,w):-AP(a0,p),PT(p,t1),TW(t1,w)

==Translate meta-path to Datalog rules=

Meta-path APA is a palindrome of length 3

PA(p,a):-AP(a,p)

AP(a,p) is EDB

(a) - [AP] - > (p)

APA(a0,p,a2):-AP(a0,p),PA(p,a2)

APA(a0,p):(a0) - [AP] - > (p) PA(p,a2):(p) - [AP] - > (a2)

==Translate meta-path to Datalog rules====

Meta-path APVPA is a palindrome of length 5

PA(p,a):-AP(a,p)

AP(a,p) is EDB

(a) - [AP] - > (p)

VP(v,p):-PV(p,v)

PV(p,v) is EDB

(p) - [PV] - > (v)

APVPA(a0,p1,v,p2,a3):- APV(a0,p1,v),

VPA(v,p2,a3)

APV(a0,p,v2):-AP(a0,p),PV(p,v2)

AP(a0,p):(a0) - [AP] - > (p)

PV(p,v2):(p) - [PV] - > (v2)

VPA(v0,p,a2):-VP(v0,p),PA(p,a2)

VP(v0,p):(v0) - [VP] - > (p)

PA(p,a2):(p) - [PA] - > (a2)

4.4 Using SparkHINLog to compute the path instances of meta-path A → P ← A and pathSim

RAP(x,y):- AP(y,x).

T(x,p,y):- AP(x,p),RAP(p,y).

T1(x,p,y):- T(x,p,y),x=3785.

T2(x,p,y):- T(x,p,y),y=3786.

T3(x,p,y):- T(x,p,y),x=3785,y=3786.

=Compute meta path A⟶P←A

PathSim=

Result: T1: (ArraySeq(3785, 337, 3785)), (ArraySeq(3785, 337, 3786))

Computed 2 results.

(T1 relation, number of rows=,=,2)

Result: T2: (ArraySeq(3785, 337, 3786)), (ArraySeq(3786, 337, 3786))

Computed 2 results.

(T3 relation, number of rows=,,2)

Result: T3: (ArraySeq(3785, 337, 3786))

Computed 1 results.

(T3 relation, number of rows=,1)

APA PathSim 0.5000

In this case, the value of | {p_x→y : p_x→y ∈ P} | = |T3|=1 $| {p_{x \to x} : p_{x \to x} \in P} | = | T 1 | = 2$ $| {p_{y \to y} : p_{y \to y} \in P} | = | T 2 |$ $\begin{matrix} PathSim (3785, 3786) & = \frac{2 \times 1}{2 + 2} = 0.5 (3785, 3786) \\ = \frac{2 \times 1}{2 + 2} = 0.5 \end{matrix}$

4.5 Using SparkHINLog to explore the influence of papers by using the recursive Datalog rules

declare Path(int v, int dist aggregate Min).

Path(x, d):- IsSource(s), Edge(s, x, d).

Path(x, d):- Path(y, da), Edge(y, x, db), d = da + db.

The result of the Datalog program is the CitingSet of paper 2394 and the impact of paper to paper 2394. The impact depends on the path length from paper 2394 to current paper. We use tree level of impact such as strong (length = 1). middle (length = 2), weak (length> =3) impact. Some results are as follows:

(source paper=,2394)

2394 “DIAL:_A_Programming_Language” 1980

8459:Brian_Berkowitz 8460:Michael_Hammer 402, 1

402 “System_R:_Relational_Approach” 1976 (path=,List(2394, 402))

2394- 8459:Brian_Berkowitz

8460:Michael_Hammer

402-

3943:W._Frank_King_III Strong impact

—————————————

1906, 2

1906 “SEQUEL:_A_Structured_English.” 1974

(path=,List(2394, 1255, 1906))

2394- 8459:Brian_Berkowitz

8460:Michael_Hammer

1255- 5976:Joachim_W._Schmidt

1906- 7399:Raymond_F._Boyce

7400:Donald_D._Chamberlin

Depending of the length of path from source paper to current paper, we discover the strength of impact.

In our experiment, if the path length is 1, then the impact will strong. If path length is 2, the impact will be middle and if the path length is 3 or greater, the impact will be weak.

4.6 Managing the working memory area

During evaluation process of Datalog on one Worker, the working memory area (WMA) contains the intermediate results. Since we work with large scale graph, we need to flush out of working memory area the relation which is not used in future rules. The size relations in working memory area in Megabytes (MB) of SparkDatalog and SparkHINlog are shown in Table 6.

Table 6
Working memory of SparkDatalog and SparkHINLog for one Worker

R# Spark Spark Spark Spark

Datalog Datalog HINLog HINLog

(in MB) (in MB)

1 AP, PV, RPV 0.4 RVP, PV 0.4

2 AP, PV, RPV, RAP 0.6 RPV, RAP, Ap, PV 0.6

3 AP, PV, RPV, RAP, T1 1.2 RPV, RAP, AP, PV 1.8

4 AP, PV, RPV, RAP, T2 1.6 RPV, RAP, T2, T1 1.2

5 AP, PV, PV, RPV, RAP, T2, T1, T 2.0 T2, T1, T 1.0

R#	Spark	Spark	Spark	Spark
1	AP, PV, RPV	0.4	RVP, PV	0.4
2	AP, PV, RPV, RAP	0.6	RPV, RAP, Ap, PV	0.6
3	AP, PV, RPV, RAP, T1	1.2	RPV, RAP, AP, PV	1.8
4	AP, PV, RPV, RAP, T2	1.6	RPV, RAP, T2, T1	1.2
5	AP, PV, PV, RPV, RAP, T2, T1, T	2.0	T2, T1, T	1.0

The comparison of working memory area of SparkDatalog and SparkHINlog is shown in Fig. 4. In the beginning, both versions have AP, PV relation in the working memory.

Fig.4

The usage of working memory area of SparkHINlog and SparkDatalog.

After evaluating rule 1, AP and PV relations are flushed out of memory. APV relation is retained in the working memory area. After evaluating rule 2, APV and VPA relation are still retained in working memory area. After evaluating rule 3, AP and PV relation are flushed out of working memory area.

We hold that SparkHINlog can reduce the working memory area with smart management algorithm.

4.7 Enhance the performance of join operation

The chart is created by executing a recursive rule for studying the paper influence. We do repartition HIN graph such as the vertices are sorted by vertextype (Author, Paper,Venue, Topic, Word) and edges are sorted by vertexID of source vertex.

We check the performance of GraphFrames Library with Motif Finding comparing with the join operation of the traditional SparkDatalog using GraphX.

Path(x, y):- Cite(x, y)

Path(x, y):- Cite(x, z), Path(z, y)

The time execution of join operation of Cite(x,y) and path(x, y) in recursive rules. It is repeated 15 times until Path(x,y) reaches empty set. Each repeated join operation is executed with repartition (SparkHINlog) and without repartition (SparkDatalog).

We use motif finding of GraphFrames Library to find the path instances instead of joining two big relations. We hold that the time execution is decreased significantly as shown in 6th iteration of Fig. 5.

Fig.5

The comparison of time execution of rule evaluation of SparkHINlog and SparkDatalog.

5 Related work

Recently, there has been an increase of studies in expansion of Datalog. Kornellije Rabuzin developed a Deductive Graph Database- Datalog in Action, 2016 [14]. They proposed a Datalog on the platform of Graph Database instead on Relational Database. They used neo4j Graph Database to store Graph Data and use Cypher – Graph query language to generate the facts, EDB and IDB. SociaLite is a Datalog extension for efficient social network. SociaLite was developed by [7]. As a logic programming language, Datalog can process graph. The users can process graph by using the non- recursive or recursive Datalog rules. In [8] expanded Datalog. They proposed Spark Datalog which can work with a large scale database. The system addresses with a concise declarative specification of complex queries amenable to efficient evaluation. – In DatalogRA: Datalog with recursive aggregation in the Spark RDD Model, Marek Rogala, Jan HiddersJacek Srokay [15] developed a tool which extends a distributed computations platform. This tool was developed on the Apache Spark platform. It is possible to develop graph algorithms in a Datalog query language, and execute them on distributed computing environment.

6 Conclusions & future work

In this paper, we present our SparkHINlog which is the extension of SparkDatalog for HIN Analysis. Our SparkHINlog can translate meta-path to Datalog Rules. SparkHINlog can work with large scale HIN. SparkHINlog can use recursive Datalog rules to analyze large scale HIN. SparkHINlog used a smart management of working memory area of SparkDatalog. These predicates will be retained or flushed out of the working memory are depending on whether these predicated will be used in the future rules or not. We use motif finding of GraphFrames to find the path of graph instead of joining two relations. A repartition based on a hash algorithm has been proposed to improve the motif finding of GraphFrames Library. By conducting experiment of a subset of DBLP as Bibliographic network, we confirm the performance of SparkHINlog. In the future, we plan to expand SparkHINlog for large scale weighted HIN Analysis. We develop weighted PathSim and use SparkHINlog to calculate the weighted pathsim to measure the similarity between two objects in Weighted HIN.

Footnotes

Acknowledgments

This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCMC) under the grant number B2017-26-02.

References

Li ,

Shi ,

P.S.

Yu and

Chen , HRank: A Path based Ranking Framework in Heterogeneous Information Network, Web-Age Information Management, in: 15th International Conference, WAIM 2014, 2014.

Sun ,

Han ,

Yan ,

Yu and

Wu , Path-Sim: Meta path-based top-k similarity search in heterogeneous information networks, in VLDB, 2011.

Shi ,

Kong ,

Huang ,

P.S.

Yu and

Wu , HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks, 2013. [Online]. Available: https://arxiv.org/abs/1309.7393. [Accessed 31 October 2018].

Kuck ,

Zhuang ,

Yan ,

Cam and

Han , Query-based outlier detection in heterogeneous information networks, in EDBT, 2015.

Ji ,

Sun ,

Danilevsky ,

Han and

Gao , Graph regularized transductive classification on heterogeneous information networks, in ECML/PKDD, 2010.

Shi ,

Li ,

Zhang and

Sun , A survey of heterogeneous information network analysis, IEEE Transactions on Knowledge and Data Engineering 29(1) (2017), 17–37.

Seo ,

Guo and

M.S.

Lam , SociaLite: Datalog Extensions for Efficient Social Network Analysis, in IEEE ICDE, 2014.

Shkapsky ,

Zeng and

Zaniolo , Graph Queries in a Next-Generation Datalog System, in Proceedings of the VLDB Endowment, 2013.

Drabas and

Lee , Learning PySpark, Packt, 2017.

10.

D.M.

Blei ,

A.Y.

Ng and

M.I.

Jordan , Latent dirichlet allocation, Journal of Machine Learning Research (2013), 993–1022.

11.

Bellomarini ,

Sallinger and

Gottlob , The Vadalog System: Datalog-based Reasoning for Knowledge Graphs, in VLDB Endowment, 2018.

12.

T.J.

Green ,

Huang ,

B.T.

Loo and

Zhou , Datalog and recursive query, Foundations and Trends in Databases 5(12) (2012).

13.

M.S.

Malak and

East , GraphX in Action, Manning, 2016.

14.

Rabuzin , Deductive Graph Database -- Datalog in Action, in International Conference on Computational Science and Computational Intelligence (CSCI), 2015.

15.

Rogala ,

Hidders and

Sroka , DatalogRA: Datalog with Recursive Aggregation in the Spark RDD Model, in GRADES ’16 Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, 2016.

SparkHINlog: Extension of SparkDatalog for heterogeneous information network

Abstract

Keywords

1 Introduction

2 Preliminaries

2.1 Heterogeneous information network

Table 1 Vertex types of the DBLP HIN No. VertexTypes Descriptions 1 AUTHOR Authors of papers 2 PAPER Papers 3 TOPIC Topics of papers 4 VENUE Venues at which papers are published 5 WORD Keywords of papers

2.3 Spark datalog

2.4 Motif finding of GraphFrames library

3 Methodology

3.1 SparkHINlog

Table 3 Datalog Program Rule no Rule 1 RAP(x, y):- AP(y, x) 2 RPV(x, y):- PV(y, x) 3 T1(x, p, v):- AP(x, p), PV(p, v) 4 T2(v, p, y):- RPV(p, v), RAP(v, y) 5 T3(x, p1, v, p2, y):- T1(x, p1, v), T2(v, p2, y), EValue(x)CompareNe()EValue(y)

4.1 Configuration of the distributed system

4.2 Data set

4.3 Translate meta-path to Datalog rules

4.4 Using SparkHINLog to compute the path instances of meta-path A → P ← A and pathSim

4.5 Using SparkHINLog to explore the influence of papers by using the recursive Datalog rules

4.6 Managing the working memory area

6 Conclusions & future work

Footnotes

Acknowledgments

References

Table 1
Vertex types of the DBLP HIN

No. VertexTypes Descriptions

1 AUTHOR Authors of papers

2 PAPER Papers

3 TOPIC Topics of papers

4 VENUE Venues at which papers are published

5 WORD Keywords of papers

Table 3
Datalog Program

Rule no Rule

1 RAP(x, y):- AP(y, x)

2 RPV(x, y):- PV(y, x)

3 T1(x, p, v):- AP(x, p), PV(p, v)

4 T2(v, p, y):- RPV(p, v), RAP(v, y)

5 T3(x, p1, v, p2, y):- T1(x, p1, v), T2(v, p2, y),

EValue(x)CompareNe()EValue(y)