A flexible aggregation framework on large-scale heterogeneous information networks

Abstract

OLAP (On-line Analytical Processing) can provide users with aggregate results from different perspectives and granularities. With the advent of heterogeneous information networks that consist of multi-type, interconnected nodes, such as bibliographic networks and knowledge graphs, it is important to study flexible aggregation in such networks. The aggregation results by existing work are limited to one type of node, which cannot be applied to aggregation on multi-type nodes, and relations in large-scale heterogeneous information networks. In this paper, we investigate the flexible aggregation problem on large-scale heterogeneous information networks, which is defined on multi-type nodes and relations. Moreover, by considering both attributes and structures, we propose a novel function based on graph entropy to measure the similarities of nodes. Further, we prove that the aggregation problem based on the function is NP-hard. Therefore, we develop an efficient heuristic algorithm for aggregation in two phases: informational aggregation and structural aggregation. The algorithm has linear time and space complexity. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of the proposed algorithm.

Keywords

Aggregation entropy heterogeneous information networks OLAP

1. Introduction

OLAP can provide users with aggregate results from different perspectives and granularities, which is one of the core techniques in data warehousing and data mining. Aggregation allows users to observe and model data in different dimensions and to perform drill-down, roll-up and other OLAP operations. It is the foundation of OLAP techniques. With the advent of graph data widely used to model real-world nodes and relations, aggregation on graph data becomes important for users to analyse graphs and mine valuable information.

Heterogeneous information networks are newly emerged graph data, which involve multi-type nodes and relations, such as bibliographic networks, social media networks and knowledge graphs. We give an example to illustrate heterogeneous information networks.

Example 1. Figure 1 shows a bibliographic network, a heterogeneous information network. The network contains three types of nodes – Paper, Author and Venue – and four types of relations exist among these nodes. They are Cooperate relations between authors, Write relations between authors and papers, Cite relations between papers and Publish relations between papers and venues. In addition, each type of nodes has a set of attributes, that is, Paper (ID, Topic, Keywords), Author (ID, Institute, Field, Location) and Venue (ID, Year). The attribute values of three types of nodes are given in Tables 1 –3.

Figure 1.

Bibliographic network.

Table 1.

Attribute values of Author.

ID	Institute	Field	Location
1	Cornell University	IR, DM	NY
2	Columbia University	AI, DM	NY
3	University of California	DM, DB	CA
4	University of California, Berkeley	DM, DB	CA

Table 2.

Attribute values of Paper.

ID	Topic	Keywords
5	IR	Model, compression, index
6	AI	Temporal, probabilistic, dynamic
7	DM	Graph query, preserve, compression
8	DM	Graph mining, clustering
9	DB	Graph search, top-k
10	DB	Graph, query, pattern

Table 3.

Attribute values of Venue.

ID	Year
11	2013
12	2012
13	2014

Heterogeneous information networks contain much more information than homogeneous information networks, which contain one type of node and one type of relation. Therefore, applying OLAP techniques to heterogeneous information networks is crucial for users to observe and analyse the networks from different perspectives and granularities to mine potential valuable information. Thus in this paper, we investigate the problem of aggregation on multi-type nodes and relations on large-scale heterogeneous information networks.

Example 2. Next we give two aggregate queries over the bibliographic network. The aggregate functions are set as COUNT.

Query 1. Aggregate on Paper nodes, Cite relations, and the selected attribute of Paper is Topic.

Query 2. Aggregate on Paper and Author nodes, Write relations, the selected attribute of Paper is Topic, and the selected attribute of Author is Location.

Figure 2(a) displays the aggregate result of query 1. From Figure 2(a) we can see that Paper nodes in the same aggregate nodes have the same values of Topic, for example, nodes 7 and 8 are both ‘DM’ Topic. Meanwhile, they also have the same Cite relations as the other aggregate nodes of Paper. For example, they are both cited by ‘IR’ and ‘DB’ Papers. Nodes 9 and 10 are not in the same aggregate nodes although they both have ‘DB’ Topic. This is because node 9 cites ‘DM’ Paper, while node 10 does not. Figure 2(b) gives the aggregate result of query 2. From Figure 2(b) we can see that Author nodes in the same aggregate nodes have the same values of Location and the same Write relations was the aggregate nodes of Paper, for example, nodes 3 and 4 are both in ‘CA’, and they both write papers in ‘DM’ and ‘DB’ Topics. Similarly, Paper nodes aggregated together have the same values of Topic and the same Write relations as the aggregate nodes of Author.

Figure 2.

Aggregate results.

From the two examples, we can see that flexible aggregation on multi-type nodes and relations is meaningful, since it can help users to observe the networks from their interests and focuses. Therefore, we propose a flexible aggregation framework on large-scale heterogeneous information networks in this paper.

Aggregation on graphs has attracted some attention in recent years. The existing work can be divided into two categories: aggregation on homogeneous information networks [1 –3] and aggregation on heterogeneous information networks [4, 5]. Chen et al. [1] study the aggregation problem on multiple homogeneous networks. It cannot be used for aggregation on a single network. Zhao et al. [2] propose an aggregation algorithm on a single homogeneous network by leveraging the attributes of nodes, where nodes are aggregated according to the selected attributes. The problem is that it does not concern the relations between nodes. Tian et al. [3] investigate how to get a summarization on a homogeneous network. It considers both attributes and structures, but it cannot aggregate different types of nodes and relations simultaneously. The aggregate results of existing work on homogeneous networks are all limited to one type of nodes. Yin et al. [4, 5] investigate the problem of aggregation on heterogeneous information networks, but their resulting aggregate graphs only contain one type of node. It cannot handle the aggregation queries with multi-type nodes and relations like query 2. Above all, the existing work cannot satisfy the demand for aggregation on multi-type nodes and relations. We need to construct a flexible framework to aggregate large-scale heterogeneous information networks.

In this paper, we investigate the flexible aggregation problem on large-scale heterogeneous information networks. In order to aggregate nodes with similar attributes and structures together, we employ graph entropy to measure the similarities of nodes. Since the aggregation involves two dimensions of networks, the informational dimension (node types, node attributes) and the structural dimension (edge types), we propose a two-phase aggregation algorithm: informational aggregation and structural aggregation.

The main contributions of this paper are:

We introduce a flexible aggregation framework on large-scale heterogeneous information networks, which can aggregate multi-type nodes and relations.

A novel function based on graph entropy is proposed, which is effective to measure the structural similarities of nodes with regard to different types of relations.

The aggregation problem based on the function is proved to be NP-hard. Furthermore, an efficient aggregation algorithm from two phases is proposed: informational aggregation and structural aggregation. The algorithm has linear time and space complexity.

Experiments on real-world datasets demonstrate the effectiveness of the algorithm by comparison with other algorithms. The good scalability of our algorithm has also been verified.

The remainder of this paper is organized as follows. Section 2 states related work. Section 3 defines the aggregation problem. Section 4 introduces the aggregation algorithm and Section 5 describes experiments. Section 6 draws conclusions.

2. Related work

In this section, we discuss the related work from four aspects: social data, multiple information sources, aggregation on information networks and heterogeneous information networks.

2.1. Social data

Recently, social data has become popular, for example, social media, online shopping, social networks, knowledge bases and so on. Many researchers have contributed to investigating social data.

Hu et al. [6], Kempe et al. [7], Azadeh et al. [8] and Jung [9, 10] have investigated the spread of influence in social networks. Hu et al. [6] models community and topic as latent variables to characterize the topic diffusion at community level. Kempe et al. [7] models the processes by which ideas and influence propagate through a social network. Azadeh et al. [8] propose the time-sensitive influence maximization problem, which takes into account the time dependence of the information value. Jung [9] proposes a robust information diffusion model to detect the malicious peers from which a risk has been generated on a P2P network. Jung [10] focuses on online social tagging systems to understand information propagation.

Jung [11], Lai et al. [12], Corbellini et al. [13] and Mao et al. [14] make recommendations to users by analysing social networks. Jung [11] proposes a recommendation method based on different types of influences: social, interest and popularity, using personal tendencies with regard to these factors to recommend photos in a photo-sharing website. Lai et al. [12] state an attribute reduction-based mining method to select the long-tail user groups. The long-tail user groups as domain experts are employed to provide more trustworthy information. Corbellini et al. [13] present a novel architecture and a corresponding implementation for designing distributed recommending algorithms. The algorithms are expressed in terms of graph traverse operations defined by API. Mao et al. [14] explore new methods of tag-based personalized recommendation without assuming that tags assigned by a user occur independently of each other. The new methods profile users using tag co-occurrence networks, upon which link-based node weighting methods are applied to refine the weights of tags.

Bohlouli et al. [15] and Vilares et al. [16] study the sentiments in social networks. Bohlouli et al. [15] mine customer needs and market-oriented productions by analysing the sentiments in social networks. Vilares et al. [16] use sentiment analysis to rank political leaders, parties and personalities for popularity. Ajao et al. [17] and Xu et al. [18] discuss the location research in social data. Ajao et al. [17] survey a range of techniques applied to identify the locations of Twitter users. Xu et al. [18] summarize popular information from massive tourism blogs to explore hot tourism locations.

2.2. Multiple information sources

Nowadays, data analysis often needs to integrate multiple information sources to obtain better results. Li et al. [19] and Wu et al. [20] investigate the problem of integrating multiple information sources in biological science. Li et al. [19] develop mine patterns across many biological networks. Such patterns include frequent dense subgraphs, frequent dense node sets, generic frequent patterns and differential subgraph patterns. Wu et al. [20] mine frequent coherent dense subgraphs across large numbers of massive networks to discover biological modules.

Wu et al. [21] and Boden et al. [22] study the problem of subgraph mining in multiple networks. Wu et al. [21] investigate the problem of finding the densest connected subgraph in dual networks, in which one network represents the physical world and the other network represents the conceptual world. Boden et al. [22] introduce the multi-layer coherent subgraph model, which defines clusters of nodes that are densely connected by relations with similar labels in a subset of the graph layers.

Furthermore, Dong et al. [23] address the problem of combining different layers of multi-layer graphs for improved clustering compared with using layers independently. Zhao et al. [24] propose three sampling methods on a hybrid social-affiliation network to estimate target graph characteristics. Cui et al. [25] incorporate multiple networks from the publication datasets, such as the paper citation network and author citation network, and analyse how people are citing papers by applying the logistic regression model to the paper citation network. This can obtain improved results of link prediction in the citation network. Nguyen and Jung [26] propose a privacy-aware framework for a social identity matching method across multiple social networks. Jung [27] finds the semantic relationships between the information sources in order to distribute user queries to a large number of sources.

2.3. Aggregation on information networks

Aggregation is the foundation of OLAP, which allows users to view the data from different perspectives and granularities. Traditional aggregation is implemented on a relational database [28, 29], which only focuses on the attributes of tuples. The existing aggregation work on graph data can be classified into two categories: aggregation on homogeneous information networks and aggregation on heterogeneous information networks.

Chen et al. [1, 30, 31] were the first to propose the idea of aggregation on graph data. They study the aggregation problem on multiple homogeneous networks, which cannot be used for aggregation on a large network. Zhao et al. [2] were the first to study aggregation on a single homogeneous network. It is assumed that nodes of networks have the same types and attributes. The network aggregates nodes according to their attributes, and does not consider the structures. Actually, the aggregate results show no difference from traditional aggregation. Wang et al. [32] extend the work of Zhao et al. [2] to a parallel environment. However, both of them are applied to homogeneous networks. Tian et al. [3] and Zhang et al. [33] propose an aggregation algorithm on homogeneous graphs. They utilize both attributes and structures to summarize the networks into k groups. Since the aggregate results are restricted to one type of node and the number of groups is restricted to k, it cannot handle the aggregate queries with multi-type nodes and relations on heterogeneous information networks.

Yin and Gao [4] were the first to propose aggregation on heterogeneous information networks. They compute iceberg aggregate graphs whose aggregate function values are larger than a specific value. Yin et al. [5] investigate the aggregation and partial materialization on heterogeneous information networks. Unfortunately, their aggregate results only contain one type of nodes, which cannot handle aggregate queries on multiple types of nodes and relations.

There is also some research work on clustering of heterogeneous information networks [34 –39]. Zhou et al. [35] display a clustering algorithm on multiple meta-paths over one type of node in heterogeneous information networks. Sun et al. [36] address the problem of generating clusters for a specified type of nodes, as well as ranking information for all types of nodes based on these clusters. However, clustering is quite different from aggregation. Clustering results only relate to the networks, without the user-selected types of nodes and relations. Clustering cannot support users to observe the networks from different perspectives and granularities. It only provides results from one certain perspective which has been fixed by the algorithm.

In summary, there is no existing work on the flexible aggregation problem of large-scale heterogeneous information networks. Thus we investigate this problem in this paper.

2.4. Heterogeneous information networks

There has been some research work on heterogeneous information networks. Similarity search is an important topic in heterogeneous information networks [40, 41]. Sun et al. [40] search the top-k similar nodes, where nodes are connected by different types of relations in heterogeneous information networks. Shi et al. [41] measure the similarities between nodes with different types in heterogeneous information networks. Sun et al. [42, 43] study the link prediction problem in heterogeneous information networks. Sun et al. [42] study the problem of co-author relationship prediction. Sun et al. [43] investigate the problem of predicting when a certain relationship will happen. Yu et al. [44] and Gupta et al. [45] study the problem of subgraph query. Given a subgraph query, Yu et al. [44] search a given large information network and find a set of subgraphs that are structurally identical and semantically similar. Gupta et al. [45] compute the top-k similar subgraphs with respect to query. Sun et al. [46] build a unified generative topic model that is able to consider both text and structure information for documents. Ji et al. [47] combine ranking and classification in order to perform more accurate analysis. Integrating ranking with classification generates more accurate classes. Shen et al. [48] focus on entity identification in web text networks which can be formalized as heterogeneous information networks. Jayaram et al. [49, 50] and Yang et al. [51] propose query structure-flexible data, that is, knowledge graphs, for example entity tuples, without requiring users to write complex graph queries.

3. Preliminaries

3.1. Basic concepts

Definition 1. (Heterogeneous information network). A heterogeneous information network is defined as a directed graph $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , where V is the node set, $E \subseteq V \times V$ is the edge set. T is the set of node types, and R is the set of edge types. $ϕ_{V} : V \to T$ is the node type mapping function and $ϕ_{E} : E \to R$ is the edge type mapping function. A is the attribute set of nodes and D is the domain of A. $ϕ_{A} : T \to A$ is the mapping function from node types to attributes.

Definition 2. (Graph projection). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , selected node types $Q = {T_{1}, T_{2}, \dots, T_{l}}$ , $Q \subseteq T$ , and selected edge types $L = {R_{1}, R_{2}, \dots, R_{k}}$ , $L \subseteq R$ . The graph projection of G on Q and L is a graph $G_{pj} = (V_{pj}, E_{pj})$ , where $V_{pj}$ is the node set, $V_{pj} = {v | v \in V, ϕ_{V} (v) \in Q}$ , $E_{pj}$ is the edge set. For $\forall u, v \in V_{pj}$ , $(u, v) \in E_{pj}$ , iff $(u, v) \in E$ and $ϕ_{E} (u, v) \in L$ .

It is easy to see that the graph projection of G on Q and L is an induced graph of G, in which the types of nodes belong to Q and the types of edges are in L. Figure 3 is a graph projection of the bibliographic network, in which Q = {Author, Paper} and L = {Write}. Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , and an attribute set $S = {A_{1}, A_{2}, \dots, A_{k}}$ , $S \subseteq A$ , for $\forall u, v \in V$ , if $A_{i} (u) = A_{i} (v)$ ( $1 \leq i \leq k$ ), then we say S(u) = S(v).

Figure 3.

Graph projection.

Definition 3. (Graph partition). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , selected node types $Q = {T_{1}, T_{2}, \dots, T_{l}}$ , $Q \subseteq T$ , with selected attributes $S = {S_{1}, S_{2}, \dots, S_{l}}$ , $S_{i} \subseteq ϕ_{A} (T_{i})$ , and selected edge types $L = {R_{1}, R_{2}, \dots, R_{k}}$ , $L \subseteq R$ . We get the graph projection $G_{pj} = (V_{pj}, E_{pj})$ of G on Q and L. The partition of $G_{pj}$ w.r.t. Q, S and L is a set of subgraphs $G_{p} = (G_{1}, G_{2}, \dots, G_{m})$ , satisfying:

For $\forall G_{i} \in G_{p}$ , $G_{i} = (V_{i}, E_{i})$ , $G_{i}$ is a subgraph of $G_{pj}$ .

$⋃_{i = 1}^{m} V_{i} = V_{pj} .$

For $G_{i}, G_{j} \in G_{p}$ , $i \neq j$ , $V_{i} \cap V_{j} = \emptyset$ .

For $\forall u, w \in G_{i}$ , $\exists T_{j}$ , $ϕ_{V} (u) = ϕ_{V} (w) = T_{j}$ , $S_{j} (u) = S_{j} (w)$ .

For $\forall u, w \in G_{i}$ and $\forall G_{j}$ , if $\exists u^{'} \in G_{j}$ , $(u, u^{'}) \in E$ and $ϕ_{E} ((u, u^{'})) \in L$ , then $\exists w^{'} \in G_{j}$ , $(w, w^{'}) \in E$ and $ϕ_{E} ((w, w^{'})) \in L$ .

For $\forall u, w \in G_{i}$ , $(u, v) \in E_{i}$ , iff $(u, v) \in E$ and $ϕ_{E} (u, v) \in L$ .

The graph partition on a graph projection $G_{pj}$ of G is to partition the nodes of $G_{pj}$ according to the selected attributes and edge types. Figure 4 is a graph partition of Figure 3, where $S_{Author}$ = {Location}, $S_{Paper}$ = {Topic} and L = {Write}.

Figure 4.

Graph partition.

Definition 4. (Aggregate graph). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , selected node types $Q = {T_{1}, T_{2}, \dots, T_{l}}$ , $Q \subseteq T$ , with selected attributes $S = {S_{1}, S_{2}, \dots, S_{l}}$ , $S_{i} \subseteq ϕ_{A} (T_{i})$ and selected edge types $L = {R_{1}, R_{2}, \dots, R_{k}}$ , $L \subseteq R$ . We get the graph projection $G_{pj} = (V_{pj}, E_{pj})$ of G on Q and L, and the graph partition $G_{p} = (G_{1}, G_{2}, \dots, G_{l})$ of $G_{pj}$ w.r.t. Q, S and L. The aggregate graph of G on Q, S and L is a graph $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ , where $V_{c}$ is the node set, $E_{c}$ is the edge set, $f_{1}$ is the aggregate function on $V_{c}$ and $f_{2}$ is the aggregate function on $E_{c}$ , satisfying:

$| V_{c} | = | G_{p} |$ .

$\forall a \in V_{c}$ , a corresponds to a subgraph $G_{a} \in G_{p}$ .

$\forall a, b \in V_{c}$ , a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$ , if $a \neq b$ , then $G_{a} \neq G_{b}$ .

$\forall a \in V_{c}$ , a corresponds to a subgraph $G_{a}$ , $f_{1} (a) = f_{1} (V_{a})$ .

$\forall a, b \in V_{c}$ , a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$ , $(a, b) \in E_{c}$ , iff $\exists u \in V_{a}$ , $w \in V_{b}$ , $(u, w) \in E$ and $ϕ_{E} ((u, w)) \in L$ .

$\forall (a, b) \in E_{c}$ , a corresponds to a subgraph $G_{a}$ and b corresponds to a subgraph $G_{b}$ , $f_{2} ((a, b)) = f_{2} {(u, w) | u \in V_{a}, w \in V_{b}, (u, w) \in E, ϕ_{E} ((u, w)) \in L}$ .

Figure 5 shows an aggregate graph on the bibliographic network. The selected node types Q = {Author, Paper}, with selected attributes S = { $S_{Author}$ , $S_{Paper}$ }, where $S_{Author}$ = {Location} and $S_{Paper}$ = {Topic}, and selected edge types L = {Write}. f₁ and f₂ are COUNT. For $\forall a \in V_{c}$ , we call it aggregate node. For $\forall e \in E_{c}$ , we call it aggregate edge. To avoid confusions, we use nodes referring to nodes in heterogeneous information networks, and aggregate nodes referring to nodes in aggregate graphs. In real-world applications, aggregate functions can be selected freely, for example, COUNT and AVERAGE.

Figure 5.

Aggregate graph.

3.2. Graph entropy

As shown in Definition 4, nodes with the same values of selected attributes and the same neighbours whose types belong to selected types of relations are aggregated. Unfortunately, these constraints often seem too strict in practice. The constraint of the same values of attributes is easy to satisfy; however, the restriction of the same neighbours is difficult to realize. We relax the limitation to aggregate nodes that have the same values of selected attributes and similar structures according to the selected types of relations. We propose a novel notion to measure the structure similarities.

In information theory [52], entropy is a measure of uncertainty associated with a random variable. Shannon entropy [53] is proposed for measuring the uncertainty, disorder and confusion of the system in information theory. Dehmer proposes to quantify the complexity of a graph using Shannon’s definition of entropy such that the probability distribution is computed from node degrees [54]. The original definition of graph entropy is introduced by Korner [55], who quantifies the lower bound on the complexity of graphs. Graph entropy [56, 57] based on Shannon entropy has been widely used in graph mining. This measure quantifies entropy using the localized features of a graph’s nodes such as closeness centrality and degree centrality. These centrality measures, however, do not fully capture the complexity of a graph, that is, they are limited to local neighbourhoods, and the approach appears to be ad hoc owing to the arbitrary choice of distribution functions. Intuitively, entropy is a means to quantify the amount of uncertainty in a distribution. We employ graph entropy to measure the structural consistency of nodes. Next we propose a graph entropy definition which is a variant of the classic graph entropy called topological information content [58].

Definition 5 (graph entropy). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ is an aggregate graph of G on Q, S and L. For $\forall a = (V_{a}, E_{a}), b = (V_{b}, E_{b}) \in V_{c}$ , the entropy from a to b is:

H_{b} (a) = {\begin{matrix} - \frac{| V_{b} (a) |}{| V_{a} |} \cdot lo g_{2} \frac{| V_{b} (a) |}{| V_{a} |} & | V_{b} (a) | \neq 0 \\ 0 & | V_{b} (a) | = 0 \end{matrix}

(1)

where $V_{b} (a) = {v | v \in V_{a}, \exists w \in V_{b}, (v, w) \in E, ϕ_{E} ((v, w)) \in L}$ . The entropy of a is :

H (a) = \sum_{b \in V_{c}} λ_{a, b} H_{b} (a)

(2)

The entropy from a to b measures the structural similarities of nodes in a regarding to b. $λ_{a, b}$ represents the weight of relations from nodes in a to b, standing for the different importance of different types of relations. The smaller H(a) is, the more similar structures nodes have. When H(a) = 0, nodes in a have either all neighbours in b, or not at all.

Example 3. As shown in Figure 6, three aggregate nodes a, b and c are displayed. Assuming that all the weights of relations are 1, the entropies of aggregate nodes are given as follows:

H (a) = H_{b} (a) + H_{c} (a) = (- \frac{3}{5} lo g_{2} \frac{3}{5}) + (- \frac{3}{5} lo g_{2} \frac{3}{5}) \approx 0.88

(3)

H (b) = H_{a} (b) + H_{c} (b) = (- \frac{4}{4} lo g_{2} \frac{4}{4}) + 0 = 0

(4)

H (c) = H_{a} (c) + H_{b} (c) = (- \frac{5}{6} lo g_{2} \frac{5}{6}) + 0 \approx 0.22

(5)

Figure 6.

Graph entropy.

The entropy of b is the lowest. All nodes in b have neighbours in a. None have neighbours in c. They have the same structures. The entropy of a is the largest. Three nodes have relations with b, and three are relative with c. Graph entropy provides a powerful mechanism to aggregate nodes with similar structures.

When graph entropy is 0, nodes in the same aggregate nodes have the same structures. As mentioned before, this constraint is too strict, which will cause the large size of aggregate graph. We use the size of aggregate graph to control the structural similarities of nodes in the same aggregate nodes, which is a balance between them.

Definition 6 (C-function). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , selected node types $Q = {T_{1}, T_{2}, \dots, T_{l}}$ , $Q \subseteq T$ , with selected attributes $S = {S_{1}, S_{2}, \dots, S_{l}}$ , $S_{i} \subseteq ϕ_{A} (T_{i})$ , and selected edge types $L = {R_{1}, R_{2}, \dots, R_{k}}$ , $L \subseteq R$ . $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ is an aggregate graph of G on Q, S and L, the C-function of $G_{c}$ is:

F (G_{c}) = \sum_{i = 1}^{l} α_{i} \sqrt{NU M_{T_{i}}} + \sum_{a \in V_{c}} H (a)

(6)

where $NU M_{T_{i}} = | {a | a \in V_{c}, \forall u \in V_{a}, ϕ_{V} (u) = T_{i}} |$ . $NU M_{T_{i}}$ represents the number of aggregated nodes with type $T_{i}$ . $α_{i}$ is used to distinguish the importance of different types of nodes. Fewer aggregate nodes are easier for users to understand. However, the graph entropy will increase, leading to fewer structural similarities of nodes.

3.3. Problem definition

Problem: Flexible aggregation on heterogeneous information networks.

Input: A heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , node types $Q = {T_{1}, T_{2}, \dots, T_{l}}$ , $Q \subseteq T$ , attributes $S = {S_{1}, S_{2}, \dots, S_{l}}$ , $S_{i} \subseteq ϕ_{A} (T_{i})$ , edge types $L = {R_{1}, R_{2}, \dots, R_{k}}$ , $L \subseteq R$ , aggregate functions $f_{1}$ and $f_{2}$ .

Output: Aggregate graph $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ .

Object: Minimize $F (G_{c})$ .

Theorem 1. The problem of flexible aggregation on heterogeneous information networks is NP-hard.

Proof. We use proof by restriction to prove the NP hardness of this problem.

The complexity of this problem is NP, because a non-deterministic algorithm only needs to guess an aggregation and check in polynomial time. The aggregation can be generated by a polynomial time algorithm.

We make restrictions that $α_{i} = 1$ and $λ_{ij} = 1$ , then the C-function becomes $F (G_{c}) = \sqrt{| V_{c} |} + \sum_{a \in V_{c}} \sum_{b \in V_{c}} H_{b} (a)$ . If we restrict the size of aggregate graph $| V_{c} |$ to be a specific size, we can convert the problem of minimizing C-function to be minimizing graph entropy $\sum_{a \in V_{c}} \sum_{b \in V_{c}} H_{b} (a)$ . If we make the sizes of aggregate nodes equal, this problem becomes a Minimum-Entropy Set Cover Problem [59], which is NP-hard. The problem is NP-hard.

4. Aggregation algorithm

In this section, we introduce an efficient aggregation algorithm for aggregate queries. Since the aggregation problem is NP-hard, it is intractable to give an exact solution in polynomial time. We propose a heuristic algorithm to aggregate networks with linear time cost. To distinguish nodes in semantics of attributes and structures, we design a two-phase aggregation framework, containing informational aggregation and structural aggregation.

Informational aggregation. In heterogeneous information networks, according to the selected types of nodes and attributes, partition nodes while guaranteeing that nodes aggregated together have the same types and attribute values.

Structural aggregation. Moreover, we make nodes in the same aggregate nodes consistent in structures. We partition the aggregate nodes iteratively according to the structures of nodes. The iteration does not stop until the C-function value reaches a maximum value or the size of aggregate graph exceeds a specific threshold.

4.1. Informational aggregation

This process depends on the selected node types and attributes. First search the network to get the graph projection of G on Q, then partition nodes in graph projection according to S to get the graph partition. Nodes with the same values of S are aggregated together. Finally, compute aggregate value. Table 4 presents the pseudo-code of Algorithm 1.

Table 4.

Informational aggregation on heterogeneous information networks.

Algorithm 1:	Informational aggregation on heterogeneous information networks.
Input:	Network $G$ , node types $Q$ , attributes $S$ , edge types $L$ , aggregate functions $f_{1}$ , $f_{2}$ ;
Output:	Aggregate graph $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ .
Initialization:
1	Initialize $V_{c}$ and $E_{c}$ to be NULL;
Aggregate node:
2	for all $u \in V$ do
3	if $ϕ_{V} (u) \in Q$ , $ϕ_{V} (u) = T_{i}$ and aggregate node $u^{'}$ has not existed then
4	Construct an aggregate node $u^{'} \in V_{c}$ according to $T_{i}$ and $S_{i} (u)$ ;
5	Insert u into $u^{'}$ ;
6	Update the aggregate value $f_{1} (u^{'})$ ;
Aggregate edge:
7	for all $a, b \in V_{c}$ do
8	if $u \in a$ , $v \in b$ , $(u, v) \in E$ , $(a, b) \notin E_{c}$ and $ϕ_{E} ((u, v)) \in L$ then
9	Construct an aggregate edge $(a, b) \in E_{c}$ ;
10	Compute the aggregate value $f_{2} ((a, b))$ ;
11	return $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ .

In Algorithm 1, we first initialize the node and edge sets of aggregate graph to be NULL (Line 1). Then we traverse heterogeneous information network G. For each node u in V, if the type of u belongs to query types and there is not an aggregate node with the same attributes of u, create $u^{'}$ and insert it into $V_{c}$ (lines 2–4). Insert u into the aggregate node $u^{'}$ and update $f_{1} (u^{'})$ (lines 5–6). For each pairwise aggregate nodes a and b, if there are nodes $u \in a, v \in b, (u, v) \in E$ and the type of $(u, v)$ belongs to L, create an edge $(a, b) \in E_{c}$ (lines 7–9). Compute aggregate value $f_{2} ((a, b))$ (line 10). Finally, return the informational aggregate graph $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ (line 11). The time complexity of Algorithm 1 is $O (| V | + | E |)$ , used for traversing G. The space complexity is $O (| V | + | E |)$ , consumed by the aggregate graph.

4.2. Structural aggregation

Informational aggregation makes nodes in the same aggregate nodes have identical types and attribute values. However, this is not enough due to the complex structural characteristics of heterogeneous information networks. We can reduce the value of the C-function by decreasing graph entropy. The task of structural aggregation is to iteratively partition aggregate nodes according to their graph entropies. Structural aggregation is faced with three challenges: how to choose the partitioned aggregate nodes; what strategy should be used for partition; and when the iteration should stop. Next we discuss how to tackle the three challenges.

4.2.1. Challenge 1

In order to decrease the value of C-function, we choose the aggregate node with the largest graph entropy. This is reasonable, since nodes in it have diverse structures. After partitioning, the structural similarities of nodes are improved. Meanwhile, in order to improve the readability of aggregate graphs, for the aggregate nodes with the same graph entropy, alternatively, we first choose the one with the larger size, since dividing small aggregate nodes into smaller ones is not meaningful. We give Definition 7 to measure the balance between readability and structural consistency. In each iteration, we choose the aggregate graph with the largest partition level. Next we give the definition of partition level.

Definition 7 (partition level). Given a heterogeneous information network $G = (V, E, T, R, ϕ_{V}, ϕ_{E}, A, D, ϕ_{A})$ , $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ is an aggregate graph of G. For $a \in V_{c}$ , the partition level of a is:

P (a) = \sqrt{| a |} \cdot H (a)

(7)

4.2.2. Challenge 2

In each iteration, we select an aggregate node to partition according to the structures of nodes in the aggregate node. How do we partition the aggregate nodes? We can aggregate the nodes which have the same relations of L to the other aggregate nodes. However, this would result in high time cost. Meanwhile, it is almost impossible that nodes have the same relations of L as the other aggregate nodes. It would be better to make nodes with similar relations together. In order to make the algorithm efficient, we design a heuristic strategy for this challenge. In each iteration, we divide the aggregate node into two aggregate nodes according to the nodes’ neighbours with t, where $t = \arg max_{b} {λ_{a, b} H_{b} (a)}$ . Intuitively, this means that we partition nodes in the aggregate nodes according to an aggregate node t. Nodes that have neighbours in t are partitioned into one node, and the others are partitioned into another.

4.2.3. Challenge 3

In view of C-function minimization, the size of aggregate graphs should be moderate. It is a balance between the size of the aggregate graph and the nodes’ structural similarities. From the aspect of users, a large number of aggregate nodes ais hard to analyse and summarize. We should constrict the size of aggregate graphs. In the beginning, the C-function value is decreasing along with the partition of aggregate nodes. After that, for the number of aggregate nodes increasing, C-function value also increases. If we do not stop the partition it will lead to a large aggregate graph. The iteration does not terminate until the C-function reaches the first maximal value or the size of aggregate graph exceeds a specific threshold. Algorithm 2 as shown in Table 5 describes the pseudo-code of structural aggregation. Next we make a detailed analysis.

Table 5.

Structural aggregation on heterogeneous information networks.

Algorithm 2:	Structural aggregation on heterogeneous information networks.
Input:	Network $G$ , informational network $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ , max size k, $α_{i}$ , $λ_{ij}$ ;
Output:	Aggregate graph $G_{c} = (V_{c}, E_{c}, f_{1}, f_{2})$ .
1	Initialize max heap δ to be NULL;
2	for all $a \in V_{c}$ do
3	for all $b \in V_{c} (a \neq b)$ do
4	Compute graph entropy from a to b, $H_{b} (a)$ ;
5	Compute graph entropy of a, $H (a) = \sum_{b \in V_{c}} λ_{T_{a}, T_{b}} H_{b} (a)$ ;
6	Compute partition level P(a);
789	Construct the max heap δ based on P(a);repeat Fetch aggregate node a with the largest value from δ;
10	Partition a into two aggregate nodes $a_{1}$ and $a_{2}$ ;
11	Delete a and insert $a_{1}$ and $a_{2}$ into $V_{c}$ ;
12	for all $b \in V_{c}$ do
13	Compute graph entropy from b to $a_{1}$ and $a_{2}$ ;
14	Compute graph entropy of $a_{1}$ and $a_{2}$ ;
15	Update δ;
16	Compute C-function $F (G_{c})$ ;
17	until $F (G_{c})$ reaches maximal value or $\| V_{c} \| = k$ ;
18	return $G_{c}$ .

In Algorithm 2, we first construct a max heap (line 1). Compute the partition levels of all aggregate nodes (lines 2–6), which costs $O (| V_{c} |^{2} \cdot | E |)$ time. Next we construct the max heap for aggregate nodes based on partition levels, which consumes $O (| V_{c} |)$ time (line 7). Select the aggregate node with the largest partition level from the max heap and partition it into two aggregate nodes and update $V_{c}$ (lines 9–11). Update the graph entropies of nodes in $V_{c}$ (lines 12–14). Update the max heap (line 15). Compute the C-function value $F (G_{c})$ of the aggregate graph (line 16). The time complexity of the above steps is $O (| V | + | E | + | V_{c} | + | V_{c} |^{2} \cdot | E |)$ . The iteration does not terminate until the $F (G_{c})$ reaches the first maximal value or the size of aggregate graphs is k (line 17). The time complexity of Algorithm 2 is $O (min (t, k) (| V | + | E | + | V_{c} | + | V_{c} |^{2} \cdot | E |))$ , where t is the times of iteration. Obviously, the constraint $| V_{c} | << | V |$ is reasonable, then we have iteration times $t << | V |$ and $k << | V |$ . Therefore, the time complexity of Algorithm 2 is $O (| V | + | E |)$ . The space of the maximum heap and aggregate graph is $O (| V_{c} |)$ and $O (| V | + | E |)$ . The space complexity of Algorithm 2 is $O (| V | + | E |)$ . Above all, the aggregation algorithm costs $O (| V | + | E |)$ time and space.

5. Experimental evaluations

All experiments were done on a Microsoft Windows 7 machine with an Intel^® Core i5-2400 CPU 3.1 GHz and 4 GB main memory. Programs were compiled by Microsoft Visual Studio 2010.

5.1. Datasets and experimental setup

We used the real-world heterogeneous information network from Amazon [60]. The network form is given in Figure 7. It contains three types of nodes: Customer; Product; and Category. Four types of relations exist among nodes: Co-Purchase relations between products; Purchase relations between customers and products; Classify relations between products and categories; and Like relations between customers, which represent customers who have reviewed the same products. Each type of nodes has a set of attributes: Customer (Name, Purchase times); Product (Name, Rank, Score, Price, Reviews); and Category (Name). In the effectiveness evaluation, the dataset includes 53,182 customers, 5000 products and four categories, which are Music, Book, DVD and Video. There are 147 edges of Co-Purchase, 77,997 edges of Purchase, 7231 edges of Like and 5000 edges of Classify. We use COUNT as the aggregate functions of nodes and edges. The parameters are set as follows: $α_{Customer}$ = 1, $α_{Product}$ = 1, $α_{Category}$ = 1, $λ_{Product \leftrightarrow Product}$ = 15, $λ_{Customer \leftrightarrow Product}$ = 2, $λ_{Customer \leftrightarrow Category}$ = 1 and k = 20. By plotting the distribution of the number of purchasing times, we assign an attribute Purchase-Power to Customer: customers who purchased 1 product are tagged Low; those who purchased 2–9 products are tagged Medium; and those who purchased more than 9 products are tagged High.

Figure 7.

Amazon network.

5.2. Effectiveness evaluation

We first evaluate the performances of aggregate algorithms. We compare our algorithm with Zhao et al. [2], who aggregate nodes with the same values of attributes in homogeneous networks, and do not consider the structure of graphs. In our experiments, we make a few changes to the compared algorithm. We apply the compared algorithm to each type of node in heterogeneous information networks, without taking the structures into consideration. Actually, the compared algorithm is equal to informational aggregation. Our experimental results demonstrate that our algorithm performs better by taking the structural aggregation into consideration, which can provide more accurate and implicit knowledge with a wealth of information.

In our experiments, we evaluate the effectiveness of the algorithms on queries 1 and 2.

Query 1. A set of node types Q = {Product, Category} with attributes $S = {S_{Product}, S_{Category}}$ and relation types L = {Co-Purchase, Classify}, where $S_{Product} = \emptyset$ and $S_{Category} = {Name}$ .

Query 2. A set of node types Q = {Customer, Product, Category} with attributes $S = {S_{Customer}, S_{Product}, S_{Category}}$ and relation types L = {Co-Purchase, Classify, Purchase}, where $S_{Customer}$ = {Purchase-Power}, $S_{Product} = \emptyset$ and $S_{Category} = {Name}$ .

Figure 8 shows the aggregate graph of query 1 by the compared algorithm. The values of nodes and edges represent their aggregate values. In order to describe them clearly, we use dotted lines to represent the edges whose function values are below 10, and the other edges are represented by solid lines. From Figure 8 we can observe that the largest category of products is books, music stands second, and DVD and videos last. The co-purchased DVD products are not co-purchased with music and DVD. The same situation exists in the co-purchased products of videos. Products of music and DVD are not co-purchased, as are music and videos. Figure 9 presents the aggregate results of query 1. From Figure 9 we can see that our algorithm presents a deeper result than the compared algorithm. Besides the information we can get from Figure 8, much richer information can be obtained from Figure 9. Books that were co-purchased with music products are not likely to be co-purchased with DVD and video products. Meanwhile, the co-purchased music products are not co-purchased with books, which may be co-purchased with DVD and video products. The aggregate results are interesting after considering structures.

Figure 8.

Aggregate graph of query 1 by the compared algorithm.

Figure 9.

Aggregate graph of query 1 by our algorithm.

Figure 10 presents the aggregate result of query 2 by the compared algorithm. Edges between customers and products with aggregate values below 1000 are displayed by dotted lines. Otherwise, customers and products are connected by solid lines. In Figure 10, we can see that more than 80% of customers have low Purchase-Power. Only a few customers have high Purchase-Power. Figure 11 presents the aggregate result of query 2. After adding structural information, we can see that 25% of customers with low Purchase-Power only buy music products, which are never co-purchased with books. Customers who have purchased once are prone to buy music products. All these three types of customers have co-purchased books and music products. Book, music and DVD products are never co-purchased, as are book, music and video products. Query 2 is a drill-down operation of query 1 by adding a node type Customer and a relation type Purchase. Similarly, query 1 is a roll-up operation of query 2 by removing the node type Customer and relation type Purchase. OLAP on heterogeneous information networks can help users to mine valuable information from different perspectives. For example, the mutual complicated relationships between products and categories can be observed from a coarse view such as in query 1, or in a detailed way such as in query 2.

Figure 10.

Aggregate graph of query 2 by the compared algorithm.

Figure 11.

Aggregate graph of query 2 by our algorithm.

5.3. Efficiency evaluation

We first evaluate the runtime and C-function values by varying queries. We use four queries, and two of them have been described before. We give another two queries. Query 3 is the same as query 2 except for $S_{Product} = {Rank}$ . Query 4 is the same as query 3 except for L = {Co-Purchase, Like, Classify, Purchase}.

Figure 12 presents the runtime comparisons of different algorithms by varying queries. The x-axis represents the queries from 1 to 4 and the y-axis stands for the average runtime of queries. Each query is run 10 times and we compute the average runtime. Previous work only focuses on the attributes of nodes, while our work also considers the structures of networks. Therefore our method takes more time than the previous work. In query 1, the two algorithms cost almost the same time, while in query 2, our algorithm costs 13.2s more than the compared algorithm. This is because query 2 adds a new relation type. Query 3 takes more time than query 2 since it draws a new attribute of Product. The same situation exists between queries 3 and 4. However, query 4 adds a new type of relation into the aggregation. The compared algorithm does not regard the relations in aggregation, so it costs almost the same time as queries 3 and 4. To the contrary, our algorithm takes more runtime in query 4 than query 3. As we can see in the effectiveness evaluation, although our algorithm costs more time than the compared one, we can get much richer results.

Figure 12.

Runtime.

Figure 13 displays the comparisons of C-function values of different algorithms. The x-axis represents queries 1–4 and the y-axis shows the values of the C-function. Although the aggregation including structural aggregation costs more time, the C-function values of aggregate graphs are much smaller than those of the compared algorithm. Nodes aggregated together have high similarities of attributes and structures. The compared algorithm only considers the attributes of nodes. The values of C-function by the compared algorithm increase from query 1 to query 4, since the numbers of aggregate nodes increase. Ono the contrary, our algorithm performs much better. The values of C-function decrease from query 1 to query 4. Our algorithm decreases the graph entropy by taking the structures into consideration.

Figure 13.

C-function.

Next we test the scalability of our algorithm by varying sizes of datasets: 1000, 2000, 3000, 4000, 5000, 10,000, 15,000 and 20,000 products are selected, and nodes of other types corresponding to these products are added. The number of categories is always 4. The scalable datasets are displayed in Table 6. We use the four queries mentioned before to test the scalability. In our experiments, we set k = 20 on datasets 1–5, and k = 40 on datasets 6–8. The last column of Table 6 shows the performances of algorithm on different sizes of data. As we can see, runtime increases with increasing size of dataset. The total edges have a more obvious impact than the total nodes on runtime. More edges would cost more time in structural aggregation. Not only does each iteration last longer but the times of iteration also increase. When experimental data reaches 500,000, the runtime is longer than 1 min. It is acceptable to apply the algorithm to large datasets.

Table 6.

Scalability testing with various datasets.

Dataset	Product	Customer	Purchase	Co-purchase	Like	Classify	Total nodes	Total edges	Runtime (s)
1	1000	14,787	19,542	18	355	1000	15,791	20,915	4.65
2	2000	22,715	30,838	34	1348	2000	24,719	34,220	8.23
3	3000	32,352	45,564	62	2934	3000	35,356	51,560	9.87
4	4000	41,529	60,885	85	4254	4000	45,533	69,224	13.44
5	5000	53,183	77,997	147	7231	5000	58,187	90,375	16.13
6	10,000	93,119	149,306	484	16,072	10,000	103,123	175,862	26.51
7	15,000	126,467	210,427	996	27,853	15,000	141,471	254,276	43.27
8	20,000	158,213	278,288	1724	43,975	20,000	178,217	343,987	68.79

6. Conclusion and future work

In this work we have investigated the problem of aggregation on different types of nodes and relations in large-scale heterogeneous information networks. We prove that this problem is NP-hard. To solve the problem, we have presented an aggregation algorithm that has linear time and space complexity. The algorithm consists of two phases. In the first phase, we partition the nodes according to the selected types and attributes, so that the nodes aggregated together share the same types and attribute values. In the second phase, we further make sure the aggregated nodes have similar structures. We employ graph entropy to define a function that can effectively measure the structural similarities of nodes with regard to different types of relations.

From the experiments we can see that graph entropy can effectively help users aggregate the nodes with similar structures together. The experimental results show that a great improvement of C-function values is achieved by the proposed algorithm. Our aggregation algorithm performs much better than the compared algorithm, which only adopts informational aggregation. Additionally, our algorithm is also demonstrated to have a good scalability on large datasets. When the experimental data reaches 500,000, the runtime is about 1 min. In summary, taking into account both attributes and structures is a good and solid idea for investigating flexible aggregation on large-scale heterogeneous information networks.

As further work, we will extend our work to knowledge graphs, which are schemaless and structureless. Aggregation on such graphs is not an easy task, especially for non-professional users: either no standard schema is available, or schemas become too complicated for users to completely possess. It is difficult for them to illustrate the querying node and edge types explicitly. We will modify our algorithm to make it feasible on knowledge graphs.

Footnotes

Funding

This work is supported by the National Grand Fundamental Research 973 Program (grant number 2012CB316200), the Major Program of National Science Foundation (grant number 61190115), the General Program of Natural Science Foundation (grant number 61173023) and the Key Program of Natural Science Foundation (grant number 61532015).

References

Chen

Yan

Zhu

Han

. Graph OLAP: Towards online analytical processing on graphs. In: Proceedings of international conference on data mining, Washington, DC: IEEE, 2008, pp. 103–112.

Zhao

Xin

Han

. Graph cube: on warehousing and OLAP multi-dimensional networks. In: Proceedings of ACM SIGMOD international conference on management of data. New York: ACM, 2011, pp. 853–864.

Tian

Hankins

Patel

. Efficient aggregation for graph summarization. In: Proceedings of ACM SIGMOD international conference on management of data. New York: ACM, 2008, pp. 567–580.

Yin

Gao

. Iceberg cube query on heterogeneous information networks. In: Wireless algorithms, systems, and applications. Berlin: Springer, 2014, pp. 740–749.

Yin

Gao

Zou

. Minimized-cost cube query on heterogeneous information networks. Journal of Combinatorial Optimization 2015, in press, https://link-springer-com.web.bisu.edu.cn/article/10.1007%2Fs10878–015–9967–6.

Yao

Cui

. Community level diffusion extraction. In: Proceedings of international conference on management of data. New York: ACM, 2015, pp. 1555–1569.

Kempe

Kleinberg

Tardos

. Maximizing the spread of influence through a social network. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2003, pp. 137–146.

Azadeh

Saraee

Mirzaei

. Time-sensitive influence maximization in social networks. Journal of Information Science 2015; 41(6): 765–778.

Jung

. Measuring trustworthiness of information diffusion by risk discovery process in social networking services. Quality and Quantity 2014; 48(3): 1325–1336.

10.

Jung

. Understanding information propagation on online social tagging systems: A case study on Flickr. Quality and Quantity 2014; 48(2): 745–754.

11.

Lai

Liu

. Recommendations based on personalized tendency for different aspects of influences in social media. Journal of Information Science 2015; 41(6): 814–829.

12.

Jung

. Attribute selection-based recommendation framework for short-head user group: An empirical study by MovieLens and IMDB. Expert Systems with Applications 2012; 39(4): 4049–4054.

13.

Corbellini

Mateos

Godoy

Zunino

Schiaffino

. An architecture and platform for developing distributed recommendation algorithms on large-scale social networks. Journal of Information Science 2015; 41(5): 686–704.

14.

Mao

. Profiling users with tag networks in diffusion-based personalized recommendation. Journal of Information Science 2015, in press, http://jis.sagepub.com/content/early/2015/10/12/0165551515603321.full.

15.

Bohlouli

Dalter

Dornhöfer

Zenkert

Fathi

. Knowledge discovery from social media using big data-provided sentiment analysis (SoMABiT). Journal of Information Science 2015; 41(6): 779–798.

16.

Vilares

Thelwall

Alonso

. The megaphone of the people? Spanish sentistrength for real-time analysis of political tweets. Journal of Information Science 2015; 41(6): 799–813.

17.

Ajao

Hong

Liu

. A survey of location inference techniques on Twitter. Journal of Information Science 2015; 41(6): 855–864.

18.

Yuan

Qian

. Where to go and what to play: Towards summarizing popular information from massive tourism blogs. Journal of Information Science 2015; 41(6): 830–854.

19.

Huang

. Pattern mining across many massive biological networks. In: Functional coherence of molecular networks in bioinformatics. Berlin: Springer, 2012, pp. 137–170.

20.

Yan

Huang

. Mining coherent dense subgraphs across massive biological network for functional discovery. Bioinformatics 2005; 1(1): 1–9.

21.

Jin

Zhu

. Finding dense and connected subgraphs in dual networks. In: Proceedings of IEEE international conference on data engineering. Washington, DC: IEEE, 2015, pp. 915–926.

22.

Boden

Günnemann

Hoffmann

. Mining coherent subgraphs in multi-layer graphs with edge labels. In: Proceedings of international conference on knowledge discovery and data mining. New York: ACM, 2012, pp. 1258–1266.

23.

Dong

Frossard

Vandergheynst

. Clustering with multi-layer graphs: A spectral perspective. IEEE Transactions on Signal Processing 2012; 60(11): 5820–5831.

24.

Zhao

Lui

Towsley

. A tale of three graphs: Sampling design on hybrid social-affiliation networks. In: Proceedings of IEEE international conference on data engineering. Washington, DC: IEEE, 2015, pp. 939–950.

25.

Cui

Wang

Zhai

. Citation networks as a multi-layer graph: Link prediction and importance ranking, http://snap.stanford.edu/class/cs224w-2010/proj2010/.

26.

Nguyen

Jung

. Privacy-aware matching online social identities for multiple social networking services. Cybernetics and Systems 2015; 46(1–2): 69–83.

27.

Jung

. Evolutionary approach for semantic-based query sampling in large-scale information sources. Information Sciences 2012; 182(1): 30–39.

28.

Agarwal

Agrawal

Deshpande

. On the computation of multidimensional aggregates. In: Proceedings of very large database conference. New York: ACM, 1996, pp. 506–521.

29.

Gray

Chaudhuri

Bosworth

Layman

Reichart

Venkatrao

. Data cube: A relational aggregation operator generalizing group-by, cross-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery 1997; 1(1): 29–53.

30.

Chen

Yan

Zhu

Han

. Graph OLAP: A multi-dimensional framework for graph data analysis. Knowledge and Information Systems 2009; 21(1): 41–63.

31.

Chen

Zhu

Yan

Han

Ramakrishnan

. InfoNetOLAP: OLAP and mining of information networks. In: Link mining: Models, algorithms and applications. Berlin: Springer, 2010, pp. 411–438.

32.

Wang

Fan

Wang

Tan

Agrawal

Abbadi

. Pagrol: PArallel GRaph OLap over large-scale attributed graphs. In: Proceedings of IEEE international conference on data engineering. Washington, DC: IEEE, 2014, pp. 496–507.

33.

Zhang

Tian

Patel

. Discovery-driven graph summarization. In: Proceeding of IEEE international conference on data engineering, Washington, DC: IEEE, 2010, pp. 880–891.

34.

Sun

Norick

Han

Yan

. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2012, pp. 1348–1356.

35.

Zhou

Liu

Buttler

. Integrating vertex-centric clustering with edge-centric clustering for meta path graph analysis. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2015, pp. 1563–1572.

36.

Sun

Han

Zhao

Yin

Cheng

. RankClus: Integrating clustering with ranking for heterogeneous information network analysis. In: Proceedings of international conference on extending database technology: Advances in database technology. New York: ACM, 2009, pp. 565–576.

37.

Sun

Aggarwal

Han

. Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Endowment of Very Large DataBases 2012; 5(5): 394–405.

38.

Sun

Norick

Han

Yan

. PathSelClus: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data 2013; 7(3):1–23.

39.

Sun

Han

. Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2009, pp. 797–806.

40.

Sun

Han

Yan

. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Endowment of Very Large Databases 2011; 4(11): 992–1003.

41.

Shi

Kong

. Relevance search in heterogeneous networks. In: Proceedings of international conference on extending database technology. New York: ACM, 2012, pp. 180–191.

42.

Sun

Barber

Gupta

. Co-author relationship prediction in heterogeneous bibliographic networks. In: International conference on advances in social networks analysis and mining. Washington, DC: IEEE, 2011, pp. 121–128.

43.

Sun

Aggarwal

Chawla

. When will it happen? Relationship prediction in heterogeneous information networks. In: Proceedings of ACM international conference on web search and data mining. New York: ACM, 2012, pp. 663–672.

44.

Sun

Zhao

Han

. Query-driven discovery of semantically similar substructures in heterogeneous networks. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2012, pp. 1500–1503.

45.

Gupta

Gao

Yan

. Top-k interesting subgraph discovery in information networks. In: Proceedings of IEEE international conference on data engineering. Washington, DC: IEEE, 2014, pp. 820–831.

46.

Sun

Han

Gao

. iTopicModel: Information network-integrated topic modeling. In: Proceedings of international conference on data mining. Washington, DC: IEEE, 2009, pp. 663–672.

47.

Han

Danilevsky

. Ranking-based classification of heterogeneous information networks. In: Proceedings of ACM international conference on knowledge discovery and data mining. New York: ACM, 2011, pp. 1298–1306.

48.

Shen

Han

Wang

. A probabilistic model for linking named entities in web text wit heterogeneous information networks. In: Proceedings of ACM international conference on management of data. New York: ACM, 2014, pp. 1199–1210.

49.

Jayaram

Khan

. Towards a query-by-example system for knowledge graphs. In: Proceedings of workshop on graph data management experiences and systems. New York: ACM, 2014, pp. 1–6.

50.

Jayaram

Khan

. Querying knowledge graphs by example entity tuples. IEEE Transactions on Knowledge and Data Engineering 2013; 27(10): 2797–2811.

51.

Yang

Sun

. Schemaless and structureless graph querying. Endowment of Very Large Databases 2014; 7(8): 565–576.

52.

Cover

Thomas

. Elements of information theory, 2nd edn.Hoboken, NJ: Wiley, 2006, pp. 12–14.

53.

Shannon

. A mathematical theory of communication. Bell System Technical Journal 1948; 27 (3): 379–423.

54.

Dehmer

. Information processing in complex networks: Graph entropy and information functionals. Applied Mathematics and Computation 2008; 201(1): 82–94.

55.

Korner

. Coding of an information source having ambiguous alphabet and the entropy of graphs. In: 6th Prague conference on information theory, 1973, pp. 411–425.

56.

Dehmer

Mowshowitz

. A history of graph entropy measures. Journal of Information Science 2011; 181(1): 57–78.

57.

Shetty

Adibi

. Discovering important nodes through graph entropy the case of enron email database. In: Proceedings of 3rd international workshop on link discovery. New York: ACM, 2005, pp. 74–81.

58.

Rashevsky

. Life, information theory, and topology. The Bulletin of Mathematical Biophysics 1955; 17(3): 229–235.

59.

Halperin

Karp

. The minimum-entropy set cover problem. Theoretical Computer Science 2005; 348(2): 240–250.

60.

Leskovec

Adamic

. The dynamics of viral marketing. ACM Transactions on the Web 2007; 1(1): 1–39.