W-MMP2Vec: Topic-driven network embedding model for link prediction in content-based heterogeneous information network

Abstract

Link prediction on heterogeneous information network (HIN) is considered as a challenge problem due to the complexity and diversity in types of nodes and links. Currently, there are remained challenges of meta-path-based link prediction in HIN. Previous works of link prediction in HIN via network embedding approach are mainly focused on exploiting features of node rather than existing relations in forms of meta-paths between nodes. In fact, predicting the existence of new links between non-linked nodes is absolutely inconvincible. Moreover, recent HIN-based embedding models also lack of thorough evaluations on the topic similarity between text-based nodes along given meta-paths. To tackle these challenges, in this paper, we proposed a novel approach of topic-driven multiple meta-path-based HIN representation learning framework, namely W-MMP2Vec. Our model leverages the quality of node representations by combining multiple meta-paths as well as calculating the topic similarity weight for each meta-path during the processes of network embedding learning in content-based HINs. To validate our approach, we apply W-TMP2Vec model in solving several link prediction tasks in both content-based and non-content-based HINs (DBLP, IMDB and BlogCatalog). The experimental outputs demonstrate the effectiveness of proposed model which outperforms recent state-of-the-art HIN representation learning models.

Keywords

Meta-path link prediction heterogeneous information network content-based HIN network embedding

ï»¿

1. Introduction

Link prediction is one of the fundamental problems [1, 2, 3, 4, 5] of information network analysis and mining. It supports how to anticipate the likelihood of the existence of relationships between two nodes. In recent years, we have witnessed the proliferation of multiple online networked data resources, such as WWW, encyclopedia/knowledge-based graph (Wikipedia, YAGO, Freebase etc.), bibliographic networks (DBLP, DBIS, etc.), social networks (Facebook, Twitter, Weibo, etc.). Up to present, most of contemporary network analysis models have a common basic assumption that all type of nodes and links are the same, also called homogeneous information network (HoIN) based approach, such as authorship relation network (partial network of DBLP or DBIS, etc.), friendship relation network (partial networks of social networks like as Facebook, Twitter, etc.). However, the most real-world networks contain multi-typed nodes and links which definitely contain more important information as well as rich semantic meanings of nodes and their inter-connected links. Recently, heterogeneous information network (HIN) analysis and mining [2, 6] have been widely studied due to the changes on the views of networked data as well as developing trends in data mining. There is no doubt that most of social and information networks are heterogeneous in nature which involving diversity in types of nodes and relationships between them. In HIN, beside the directed links between two nodes, like as A[author]-P[paper], P-V[venue], etc. in DBLP network we also have rich sematic relationships between same-typed nodes are represented as undirected/sequential relations, called meta-path, such as A-P-A, A-P-V-P-A, etc.

Information network embedding (INE). Recently, the approaches of INE [7, 8, 9] have widely studied due to its potential applications for solving principal problems of network mining tasks: node similarity search [10, 11, 12], clustering [13, 14], classification [6, 15], link prediction [3, 4, 5], etc. INE aims to embed network’s nodes and links into low-dimensional vector spaces but still ensures the original network’s structure is preserved. The rise of INE has changed the ways which are applied for network analyzing and mining, including principal tasks, such as: node similarity search, clustering, classification and link prediction. A good network embedding model can help to transform nodes and links of information network into low-dimensional vector space, then network mining tasks like as link prediction can be effectively completed by applying out-off-the-shelf multidimensional distance metrics and machine learning models for vector spaces. There are several well-known network representation learning models which are applied for different network’s type, including HoIN-based INE models, such as LINE, DeepWalk, Node2Vec, etc. and HIN-based INE models, like as HIN2Vec [16], Metapath2Vec [17], Metagraph2Vec [18], SHINE [4], etc. HoIN-based INE models treat all links between nodes as the same type therefore they cannot be applied for solving multi-typed links prediction problem in HINs. To capture the rich semantics of diverse links and nodes in HIN, several studies of meta-path-based (HIN2Vec [16], Metapath2Vec [17], PME [3], etc.) and meta-graph-based (Metagraph2Vec [18]) embedding techniques have been proposed. These HIN based INE models enable to better capture different semantics of links and nodes in context of heterogeneity of complex information network.

Link prediction task with INE approach. In general, link prediction task (for both HoINs [19, 20, 21, 22] and HIN [3, 16, 17, 18]) is frequently viewed as a binary classification task, for potentially relations which might be occurred between two nodes – or predicting whether the new links might be existed (with value as 1) or not (with value as 0) [1, 5, 7]. The link prediction classification model is trained by feature vectors which are created by applying multiple activation functions in representative vectors of node-pairs [3, 16] that already have the evaluated relations. Then, the trained model will be used to predict the occurrence of these relations between candidate node-pairs (not have these relations in current) in the future. There are several HIN based prediction models have been successfully applied for capturing the possibility of occurring new meta-path-based relations between same-typed nodes. However, there are remained challenges related to how to combine dependent meta-paths in solving link prediction task as well as evaluating the topic similarities between pairwise nodes in content-based HIN.

Figure 1.

Examples of existent meta-paths might affect the occurrence of other meta-paths in the future.

1.1 Problem statements

1.1.1 Combination of multiple meta-paths in link prediction of HIN

In the context of HoINs, most of recent link prediction models only support for capturing the existence of only one relation’s type between a single type of node which is represented as one-hop relations [1, 5]. For example, there are common link prediction practices in information network, such as the co-author (two authors write the same papers) or co-worker (two authors work at same a university, institute or organization) in DBLP all denoted as: A-A, or friendship relations (between two users) in social networks all denoted as: U[user]-U. However, in real-world context, the relationships between authors in DBLP and users in social networks are more complex than that. In fact, predicting the existence of specific relations via different meta-paths between pairwise nodes in HIN can provide more meaningful outcomes as well as accuracy than using only single type of meta-path. For example, the relationships between two authors might be carried out with more than one relation’s types which are represented as meta-paths, such as A-P-A (co-author), A-O(organization)-A (co-worker), A-P-V-P-A (two authors commonly submit their papers at the same venues/journals), A-P-K[keyword]-P-A (two authors share common keywords on their papers), etc. Each relation has its own meaning and totally different from the others. It is needless to say that the occurrence of one more relation in the future between two nodes might be affected by other current existent relations. Let take two meta-path A-P-A and A-P-V-P-A as an example which describes two authors have co-author and both submit their works at the same conferences/journals. These two types of relation which are mostly occurred between authors who already have co-worker (A-O-A) relation. By working at a same place, they might be much easier to form other’s relations such as working together on the same researches (A-P-A), inviting the other to participate on the same conferences (A-P-V-P-A), etc. (as illustrated in Fig. 1A). Another example on predicting the friendship relation between users on social networks like as Facebook, Twitter, Weibo, etc. the high possibility of user’s friendships (U-U) which will be occurred in the future might be influenced by current undirected relations between them, such as U-G[group]-U, U-C[comment]-P[post]-C-U, etc. Definitely, they might be likely to have a friendship relation if they participate in several common groups/fan-pages, or frequently comments about several posts, etc. (as shown in Fig. 1B). We come to the assumption that the occurrences of new links in given networks might depending on the existence of current links between nodes. By initial investigations on the DBLP and common social networks such as Facebook, we recognized that about ${>}$ 68% authors who already have A-O-A relations (who worked on the same organizations/universities) tend to have A-P-A (who co-worked on same projects/papers) and A-P-V-P-A (who published their works at the same conferences/journals) relations. So, it means that the existence of A-O-A relation will lead to the occurrence of A-P-A and A-P-V-P-A relations between two author nodes. Similar to analysis on DBLP network, we also evaluated public relationships of our friends in Facebook social networks, and we recognized that ${>}$ 71% people tend to have a friendship relation with each other if they both participated in common groups/fan-pages (having U-G-U relation) as well as frequently wrote comments about common posts (having U-C-P-C-U relation). So, we came up to an assumption that people might easily tend to have new friendship relations (U-U) if they already have other relations like as U-G-U and U-C-P-C-U.

Therefore, training the prediction model for a specific relation between same-typed objects, such as co-operation between two A-A in bibliographic networks (DBLP, DBIS, etc.) or U-U in social networks (Facebook, Twitter, etc.) with only one type of meta-path is absolutely not enough in the context of heterogeneous network. Most of HIN based INE models like as Metapath2Vec, MPBP and HIN2Vec can apply only one type of meta-path to produce the embedding model, therefore they fail to preserve multiple semantics of meta-path-based relations between two same-typed nodes. The main assumption for our works in this paper is that it will be more convincible if we predict the existence of new links depending on existent links between two pairwise nodes. For example, we predict the existence of co-authorship relations in DBLP (Fig. 2A) or new movies that users likely to watch and rate in movie networks (Fig. 2B) or new friendship relations between users in social networks (Fig. 2C). The idea of producing a new network embedding model that enables to capture multiple meta-paths in the embedded spaces of network’ nodes is our main motivation in this paper.

Figure 2.

Existent meta-paths between pairwise nodes can influence the link prediction task in HIN.

1.1.2 Topic similarity evaluation in link prediction of content-based HIN

Moreover, most of real-world HINs have a large number of nodes which are composed as text documents such as papers in DBLP or comments, posts in social networks, etc. these text-based nodes are considered as important nodes because they mostly appear in all types of meta-path, like as A-P-A, A-P-V-P-A, U-C-P-C-U, etc. These types of HIN can be called as content-based HIN. As the novel points of text analysis and mining, all text-based nodes have their own covered topics, such as papers in DBLP might have several covered topics like as “data mining”, “machine learning”, etc., comments on social networks (Facebook, Twitter, etc.) might mentioned about specific subjects like as “fashion”, “politics”, “entertainment”, etc. or films in movie networks (IMDB, TMDB, Netflix, etc.) have their descriptions which cover specific topics/genres like as “romantic”, “action”, “sci-fi”, “comedy”, etc. In fact, topics or subjects of content-based nodes have critical impact on almost network analysis and mining tasks. For example, the results of author similarity search, via meta-path A-P-V-P-A, in DBLP network can be improved by evaluating both number of path instances as well as topic similarity of papers between authors, or topic similarity of comments of users with meta-path U-C-P-C-U in Facebook social networks. Similar to node similarity search, the output of link prediction in content-based HINs also can be leveraged by combining with the topic similarity weight between content-based nodes along with the evaluated meta-paths.

Topic-driven link prediction analysis in content-based heterogeneous networks like as bibliographic networks, social networks, etc. have been considered as interesting as well as emerged as a major challenge of network analysis and mining nowadays. In general, the influences of topic similarity between nodes might lead to the occurrence of new relations, for example, two authors might likely have co-authorship relation if they are interesting on the same researching fields like as groups of authors “Jiawei Han”, “Jian Pei”, “Philip S. Yu”, etc. usually cooperate on “data mining” field or “Christopher D Manning”, “Richard Socher”, “Andrew Ng”, etc. who frequently have co-authorship relations on “natural language processing” and “AI/ML” papers, etc. (as illustrated in Fig. 3). Another example of link prediction task on IMDB movie network, two users likely to watch and rate new movies if these movies have the similar themes/topics in their descriptions with their previous watched and rated movies. These relations are represented by several meta-paths, like as U-M-A[actor]-M-U (two users watch and rate for movies which are acted by the same actors), U-M-G[genres]-M-U (two users watch and rate for movies which belong to the same genres), etc.

Figure 3.

Example of topic-driven meta-path-based link prediction in DBLP bibliographic network.

Most of recent well-known models for meta-path-based HIN embedding which can be applied to solve the link prediction task, such as: Metapath2Vec, Metagraph2Vec, SHINE, PME, etc. They focus on evaluating only the relations between nodes rather than other network’s attributes, such as topic. Therefore, it is necessary to propose a new embedding model which is able to capture not only the topological patterns of rich semantic relations but also topic similarity between content-based nodes along given meta-paths.

1.2 Our contributions

Our works in this paper are mainly focused on a novel integrated framework of topic-driven meta-path-based embedding model, namely W-MMP2Vec (Weighted Multiple Meta-path2Vec). The proposed W-MMP2Vec model is mainly concentrated on learning representations of nodes and linked meta-paths in content-based HINs. Moreover, the W-MMP2Vec model also enables to learn the latent vectors of both relationships between nodes in forms of meta-path and topic similarity of text-based nodes along evaluated meta-paths in HINs. Our contributions in this paper is the continuous developments of our previous work related to improvements on topic-driven meta-path-based similarity measurement (W-PathSim [23], DW-PathSim [24], W-PathSim++ [25]) and network representation learning (W-Metapath2Vec [26], W-Metagraph2Vec [27]). Comparing with previous HIN based INE models, our proposed W-MMP2Vec model preserves richer contextual information, not only assuming that two same-typed nodes are relevant by their existent relations but also their topic similarity weights of associated text-based nodes such as papers of authors (DBLP, DBIS, etc.), comments or posts of users (Facebook, Twitter, etc.), movie’s descriptions (IMDB, TMDB, etc.), etc. The major contributions of our works in this paper can be summarized as 3 main points, which are:

•
First of all, we proposed a novel approach of applying LDA [28] topic model for evaluating the topic distributions between text-based nodes in a given content-based HINs. Then these topic distributions of each text-based nodes will be used to compute the topic similarity scores which are used as the weights of meta-paths.
•
Secondly, we introduce methodology of W-MMP2Vec network representation learning model which support to learn the representations of network nodes and their related meta-paths. The W-MMP2Vec model applies the random walk to generate training set and negative sampling to speed up the model training processes. In W-MPP2Vec model, we use 1-layer neural network and stochastic gradient descent are applied to train and optimize the network learning model.
•
Finally, we conduct extensive empirical studies on proposed W-MMP2Vec by conducting a comprehensive evaluation with multiple HIN analysis and evaluation tasks, including link prediction and node similarity search. We conducted experiments on two real-world HIN datasets which are DBLP bibliographic network and MoviesLen100K movie network. The experimental outputs show the superiority of our proposed model by comparing with recent state-of-the-art information network embedding models.

The rest of this paper has four main sections. In the second section, we review the related and discussing about the advantages/dis advantages of HIN based network embedding models for link prediction. In third section, we briefly introduce about the baselines of HINs, meta-path-based INE and methodology of our proposed W-MMP2Vec model. We demonstrate experimental studies and discussions on Section 5. Finally, we draw conclusions and present our future work in Section 5.
2. Related works and motivations

Originally, information network embedding (INE) [7, 8] or representation learning has proposed as an approach for reducing the dimensions of distinctive features of network’s nodes and links. The ultimate goal of INE is to learn the low-dimensional latent factors which can be used to preserve the original structure of overall network. Among multiple methodologies for obtaining the network representation learning, the neural network architecture is mainly used and received important attention from researchers due to its successes in many practical implementations. At first, some previous works related to INE are concentrated only on learning embedded structure of homogeneous information networks (HoINs), such as DeepWalk [19], LINE [20], Node2Vec [22], etc. which fail to distinguish the difference in types of network’s nodes and links. Moreover, these HoINs based INE models can only preserve the directed relations between nodes and is unable to apply for capturing rich semantic sequential undirected relations in forms of meta-paths between same-typed nodes. For example, DeepWalk [19] and Node2Vec [22] models use next neighborhood random walk mechanism to generate context nodes which are used as the training set for leaning the feature vectors of network’s nodes. These models are similar to LINE [20] includes two versions which are LINE_1 and LINE_2. They enable to capture 1-hop and 2-hop neighborhood nodes to learn the network representation. To leverage the outputs of INE in heterogeneous information networks (HINs), there are several novel meta-path-based embedding approaches have been proposed to tackle challenges related to the heterogeneity of complex information network. The most well-known approach is Metapath2Vec [17] which uses meta-path-based random walk mechanism to generate the training context for each target node. Similar to Metapath2Vec, the other meta-path-based embedding models HIN2Vec [16], Metagraph2Vec [18], SHINE [4], PME [3], etc. was proposed as efficient network embedding models which have shown their effectiveness in dealing with the diversity of nodes and links in HINs. Beside the general information network embedding approaches, there is another sub-area of network embedding which is called as knowledge graph embedding approach. Knowledge-based graphs [8] such as Wikipedia, YAGO, Freebase, etc. are special type of heterogeneous networks contain multiple types of nodes/entities and labeled relations. The most well-known approaches in this sub-area are the Trans-family models, including TransE [29], TransH [30], TransR [31], etc. have been demonstrated as effective for knowledge-based graph representation learning.

Back to the problem of link prediction task in HIN by applying INE, a recent works like as Metapath2Vec/Metapath2Vec++ [17] has used a single meta-path to obtain the multi-typed node representations. However, the proposed Metapath2Vec framework only use single type of meta-path to generate the training set for each target node. Therefore, Metapath2Vec is unable to capture multiple relations in forms of meta-paths between same-typed nodes which is considered important for handling the combination of multiple meta-paths in link predict. Similar to the approach of Metapath2Vec, and Metagraph2Vec only use a specific meta-path/meta-graph to learns the latent spaces of nodes and links in HIN. Moreover, the recent HIN based INE models also mainly concentrate on evaluating the aspect of relationship between nodes only and don’t consider much about the other node’s attributes like as topic similarity. To tackle these remained challenges, in this paper we put forward a novel framework for obtain an effective network node’s representation by taking full advantage of using meta-path combination in generating the contextual neighbors of target nodes as well as applying topic similarity weight as the weights of meta-path in content-based HINs.

3. Methodology

3.1 Preliminaries

In this part, we first define principal baselines of information network, meta-paths as well as network embedding/representation learning.

Definition 1. Information network (IN) [7, 10]: is defined as directed/undirected graph, denoted as: ${G=(V,E)}$ , where ${V}$ is a set of nodes ( $v,v\in V$ ) and $E,E\subseteq V\times V$ is a set of edges $(e,e\in E)$ . There are two mapping functions ${\phi}$ and ${\psi}$ , with $\phi(V)\to A$ , $\phi(v)\in A$ and $\psi(E)\to R$ , $\psi(e)\in{R}$ , with ${A}$ is a set of node’s types and ${R}$ is a set of edge’s types. In case $|A|>1$ or $|R|>1$ the information is called heterogeneous information network (HIN) otherwise it is called homogeneous information network (HoIN).

In heterogeneous network analysis and mining, meta-path (definition 2) play an important role in defining the pattern of relation between two same-typed nodes. Meta-path is considered as a principal concept for most of HIN mining (PathSim [10], HeteSim [12], etc.) as well as both homogeneous information network (HoIN-based) and heterogeneous information network based (HIN-based) embedding approaches, such as well-known: DeepWalk [19], LINE [20], PTE [21], Node2Vec, Metapath2Vec [17], Metagraph2Vec [18], HIN2Vec [16], PME [3], etc. models.

Definition 2. Meta-path [10]: denoted as: $\mathcal{P}$ , is sequence of nodes (same/different types) with specific length ( $l$ ). The set of nodes $A_{1},A_{2}\dots A_{l}$ of given meta-path $\mathcal{P}$ which are linked via one or multiple links’ types $R_{1},R_{2}\dots R_{l-1}$ is formed as a pattern, $\mathcal{P}=A_{1}{\xrightarrow{R_{1}}}A_{1}\dots{\xrightarrow{R_{l-1}}}A_{l}$ .

Definition 3. Information network embedding (INE) [7]: with a given IN which is represented as a graph ${G=(V,E)}$ , INE is aimed to learn a mapping function: $f:V\to{\mathbb{R}}^{|V|\times d}$ with ${\mathbb{R}}^{|V|\times d}$ is a d-dimensional matrix of all nodes ( $V$ ) with each row ${\mathbb{R}}_{v}$ corresponds to latent feature vector of each target node ( $v$ ). The learned node vector space representations can benefit multiple information network mining tasks by applying out-off-the-shelf multidimensional distance metrics and supervised learning techniques.

Recently, we have witnessed tremendous raises of information network embedding (INE) (definition 3) due to its potential applications in multiple disciplines. In fact, INE has changed the ways that people normally apply for primitive information network analysis and mining tasks like as node similarity measure, clustering, classification, recommendation as well as link prediction. The approach of INE as well as other networked data embedding trends (graph embedding, knowledge-based graph embedding, etc.) are originally inspired from the Word2Vec [32] model of T. Mikolov et al. which are aimed to transform complex high-dimensional data structure into a fixed low-dimensional data structure in forms of vectors but still perverse the features and structure of original the data. In recent years, there are several practices of link predictions in HINs via INE have been proposed and attracted a lot of attentions from many researchers due to their outperformances in performance accuracy than traditional methods. However, recent HoIN-based and HIN-based models encounter problem related to the combination of existent meta-paths between pairwise nodes and weights of meta-paths during the network representation learning processes. These challenges are our main motivations for proposing W-MPP2Vec model which are carefully described in this section.

3.2 Topic similarity as meta-path weight in content-based HIN

For a given content-based HIN, all text-based nodes are considered as a documents, denoted as ( ${d}$ ). Then, the set of documents ( $D,d\in D$ ) is applied LDA topic model to extract the probabilistic topic distributions, donated as: $\text{Prob}(z_{i}|d_{{j}})={{\theta}}^{{{d}}_{{j}}}_{z_{(i,i\in|{Z}|)}}$ , with ( ${{z}}_{{i}}:{{z}}_{{i}}{\in}{Z}$ ) where ${z}$ is the latent topic distribution over each document. Each document or text-based node is now represented as ${|Z|}$ -dimensional topic distribution vectors, denoted as: $d=[\text{Prob}({{z}}_{{1}}|d),{\dots}\text{Prob}({{z}}_{{|Z|}}|d)]=[\theta^{{d% }}_{{{z}}_{{1}}},\ldots\theta^{{d}}_{{{z}}_{{|Z|}}}]$ .

Figure 4.

Examples of text-based nodes $K$ and $K^{-}$ on different meta-paths of DBLP and IMDB networks.

Finally, these topic distributions of these text-based nodes are used to compute the topic similarity weight of meta-paths which these nodes are occurred. Given a set of two text-based nodes $K$ and $K^{-}$ with the same size $n=|K|=|K^{-}|$ which are occurred in a specific meta-path ( $\mathcal{P}$ ). The set of text-based nodes ( $k,k\in K$ ) and ( $k^{-},k^{-}\in K^{-}$ ) are located at the opposite site in meta-path ( $\mathcal{P}$ ). For example, we have paper’s nodes in meta-path A-P-V-P-A (presents relation of two authors commonly submit their works at same venues/journals) and meta-path A-P-P-A-P-P-A (presents a relation of two authors frequently cite papers of same authors) (Fig. 4B). Or in IMDB network, two movie’s nodes: M1 and M2 (have movie’s descriptions as text-based content) in meta-path: U-M-A-M-U (Fig. 4C), U-M-G-M-U, etc. The similarity topic weight of a specific meta-path ( $\mathcal{P}$ ), denoted as: ${\textit{w\_topsim}}_{\mathcal{P}}$ which is calculated by using cosine similarity, as following (as shown in Eq. (1)):

$\displaystyle\textit{w\_topsim}_{\mathcal{P}}=\textit{avg}\left(\sum^{n}_{i=1}% {\frac{\overrightarrow{k}\cdot\overrightarrow{k^{-}}}{||k||\cdot||k^{-}||}}% \right)(k\in K,k^{-}\in K^{-})$ (1)

Where,

•

${\textit{w\_topsim}}_{\mathcal{P}}$ , presents for topic similarity weight of a specific meta-path $\mathcal{P}$ with a set of text-based nodes ${K}$ and ${{K}}^{{-}}$ which are located at opposite site in $\mathcal{P}$ .

•

$\overrightarrow{{k}}$ and $\overrightarrow{{{k}}^{{-}}}$ is the ${|Z|}$ -dimensional topic distribution vectors of each text-based node ${k}\in{K}$ and ${k}^{-}\in K^{{-}}$ , respectively.

The ${\textit{w\_topsim}}_{\mathcal{P}}$ plays an important role in meta-path-based HIN analysis as well as embedding mechanism of our proposed W-MMP2vec model. Given a specific meta-path $\mathcal{P}$ with a set of text-based nodes the topic similarity weight can help to evaluate how much two same-typed nodes are relevant to each other not only by the number of path instances but also their similarity in the topic of associated text-based nodes. An example with meta-path A-P-V-P-A, we all known that this meta-path presents for the relation of two authors who usually submit their papers at specific conferences/journals and specific conferences/journals might cover multiple topics/subjects such as “data mining”, “AI/machine learning”, etc. So, evaluating the number of path instances between two authors with meta-path A-P-V-P-A is not enough in this case because many famous authors who frequently share a same set of interesting venues/journals but their interesting topics/subjects is totally different. It will be more meaningful if authors who works on a same topic like “data mining” must be more relevant than authors who are interesting on “AI/machine learning”, even they participated on the same venues or submit their work at the same journals. It is the reason why the topic similarity weight of met-path is considered as a critical factor for improving the accuracy of HIN analysis and mining outputs.

3.3 W-MPP2Vec topic-driven representation learning model

3.3.1 Goals and background assumptions

Figure 5.

Learning objective of W-MPP2Vec model.

The ultimate goal of our works in this paper is to learn the vector representations of pairwise nodes in content-based HINs which support for predicting the possibility of relations in forms of meta-paths which will be occurred. As discussed previous, the new relations which might be occurred between pairwise nodes must be predicted depending the combination of multiple existent meta-paths as well as the weights of topic similarity of each meta-path. We formulate our link prediction problem in a given content-based HIN, denoted as a graph ${G=(V,E)}$ with set of existent relations in forms of meta-paths $\mathcal{P}{=}\{{\mathcal{P}}_{{1}},{\mathcal{P}}_{{2}}\dots{\mathcal{P}}_{{n}}\}$ , as predicting the possibility of a pairwise same-typed nodes ( ${a}$ ) and ( ${b}$ ) ( ${a,b}{\in}{V}$ ) with ${\phi}({a}){=}{\phi}{(b)}$ , are linked via multiple meta-paths, denoted as: ${\mathcal{P}}_{{a}{\leadsto}{b}}$ , ${\mathcal{P}}_{{a}{\leadsto}{b}}{\subseteq}\mathcal{P}$ that might occur a new relation in form of a specific meta-path ( ${\mathcal{P}}_{{i}}$ ) which is not existent in ${\mathcal{P}}_{{a}{\leadsto}{b}}$ yet or ${\mathcal{P}}_{{i}}{\notin}{\mathcal{P}}_{{a}{\leadsto}{b}}$ (Fig. 5A), (as shown in Eq. (2)):

$\displaystyle\text{Prob}({\mathcal{P}}_{i}|\langle a,b,{\mathcal{P}}_{a% \leadsto b}\rangle)\textit{ with }\phi(a)=\phi(b),{\mathcal{P}}_{i}\in\mathcal% {P},{\mathcal{P}}_{a\leadsto b}\subseteq\mathcal{P},{\mathcal{P}}_{i}\notin{% \mathcal{P}}_{a\leadsto b}$ (2)

Where,

•

${\mathcal{P}}_{{a}{\leadsto}{b}}$ , presents for a set of relations between ( ${a}$ ) and ( ${b}$ ) nodes in form of meta-path.

•

${\mathcal{P}}_{{i}}$ , presents for a relation between ( ${a}$ ) and ( ${b}$ ) nodes in form of meta-path which is not existent yet, ${\mathcal{P}}_{{i}}{\notin}{\mathcal{P}}_{{a}{\leadsto}{b}}$ .

•

$\text{Prob}({\mathcal{P}}_{{i}}|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}}\rangle)$ is the occurrence probability of ${\mathcal{P}}_{{i}}$ between ( ${a}$ ) and ( ${b}$ ) nodes.

3.3.2 Combination of multiple meta-paths for link prediction in HIN

From this idea, an intuitive approach is to build a network learning model that predict a new targeted relationships in form of meta-path ${\mathcal{P}}_{{i}}$ between any given pair of nodes $\langle{a,b}\rangle$ . The W-MPP2Vec framework includes two main phases which are training data generation and network representation learning via single-hidden-layer neural network architecture. For the training data generation phase, all network’s nodes are encoded as one-hot vectors $\overrightarrow{{a}}$ and $\overrightarrow{{b}}$ with $|{V}|$ in length. In general, for overall given HIN we will have a one-hot matrix ${{M}}_{{V}}$ with $|{V}|{\times}|{V}|$ in size which presents for all nodes of the network. The one-hot matrix ${{M}}_{{V}}$ is used as the inputs of the neural network.

Then, with an initial ( ${d}$ ) embedding vector length, the learning latent space for each pairwise node $\langle{a,b}\rangle$ are translated into embedded vectors denoted as: ${{X}}_{{a}}\overrightarrow{{a}}$ and ${{X}}_{{b}}\overrightarrow{{b}}$ , with ${{X}}_{{a}}$ and ${{X}}_{{b}}$ are $|{V}|{\times}{d}$ matrices present the latent embedding spaces of $\overrightarrow{{a}}$ and $\overrightarrow{{b}}$ , respectively. For existent relations in forms of meta-paths between pairwise node $\langle{a,b}\rangle$ , ${\mathcal{P}}_{{a}{\leadsto}{b}}$ are also transformed into a separated latent space ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}$ which is a $|\mathcal{P}|{\times}{d}$ matrix. The embedding latent space of ${\mathcal{P}}_{{a}{\leadsto}{b}}$ is in ${{X}}_{\mathcal{P}}$ denoted as: ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ . For each meta-path, ${\mathcal{P}}_{{(a}{\leadsto}{{b)}}_{{i}}}{\in}{\mathcal{P}}_{{a}{\leadsto}{b}}$ between ( ${a}$ ) and ( ${b}$ ) is also encoded as one-hot vectors $\overrightarrow{{\mathcal{P}}_{{a}{\leadsto}{b}}}$ with $|\mathcal{P}|$ in length. Finally, the model’s output is a vector $\overrightarrow{{\mathcal{P}}_{{i}}}$ with $|\mathcal{P}|$ in length that predict the occurrence probability of ${\mathcal{P}}_{{i}}$ between ( ${a}$ ) and ( ${b}$ ). Figure 6 illustrates the node representation learning flow of our proposed W-MPP2Vec mode. Inspiring from the approaches of Trans-family models [29, 30, 31], we can simply formulate the learning objective of W-MPP2Vec model as following formula (Fig. 5B) (as shown in Eq. (3)):

$\displaystyle X_{a}\overrightarrow{a}+X_{{\mathcal{P}}_{a\leadsto b}}% \overrightarrow{{\mathcal{P}}_{a\leadsto b}}+\overrightarrow{{\mathcal{P}}_{i}% }\approx X_{b}\overrightarrow{b}$ (3)

Figure 6.

The node representation learning flow of W-MPP2Vec model.

In overall, the conceptual methods of W-MPP2Vec model is assumed as a multiple-class classification problem for each pairwise node $\langle{a,b}\rangle$ of given HIN with three types of feature vector which are store in ${{X}}_{{a}}$ , ${{X}}_{{b}}$ and ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}$ at the input and the classification model is optimized to obtain the occurrence probability of vector $\overrightarrow{{\mathcal{P}}_{{i}}}$ in the output. By learning the representation of pairwise node $\langle{a,b}\rangle$ with existent relations in form of meta-paths ${\mathcal{P}}_{{a}{\leadsto}{b}}$ to predict a new existence of meta-path ${\mathcal{P}}_{{i}}$ between ( ${a}$ ) and ( ${b}$ ), we can leverage the aspect of combination several meta-paths for link predicting task in HIN with a higher confidence and accuracy rate due to the dependencies between relations in nature. Like our previous examples of predicting the co-authorship relation (A-P-A) in DBLP, two authors will be more likely to have a co-authorship relation if they already have other relations, like as A-O-A (working at the same place), A-P-V-P-A (frequently participate at common venues), etc. This assumption is also correct for all other HINs, for example in IMDB movie network, two users like to watch and rate for a new movie if they already have watched and rated for other movies that have common actors or belongs to the same genres. This prediction task is represented U-M-U meta-path when two users already have other relations in form of meta-paths like as U-M-A-M-U, U-M-G-M-U, etc.

3.3.3 Topic-driven link prediction in content-based HIN

Finally, to leverage we apply the topic similarity between sets of text-based nodes in each meta-path of ${\mathcal{P}}_{{a}{\leadsto}{b}}$ to leverage quality of network’s node representations. In fact, with a content-based HIN like as DBLP we have multiple meta-paths and not all of them can have the pairwise text-based nodes set ( ${K,}{{K}}^{{-}}$ ) such as A-P-A, A-O-A, etc. Therefore, we will have two types of meta-path in this case, which are:

•
Weighted meta-path: meta-paths that have the pairwise text-based nodes set ( ${K,}{{K}}^{{-}}$ ) and can calculated the topic similarity weight ( ${\textit{w\_topsim}}_{\mathcal{P}}$ ) (is calculated by Eq. (1)). The weight of this meta-path is a float number which is in range [0, 1].
•
Binary meta-path: meta-paths that do not have pairwise text-based nodes set ( ${K,}{{K}}^{{-}}$ ), so their weights are normally set as 1 if existing between ( $a$ ) and ( $b$ ), otherwise 0.

For each pairwise nodes $\langle{a,b}\rangle$ with set of existent meta-paths ${\mathcal{P}}_{{a}{\leadsto}{b}}$ , each meta-path ${\mathcal{P}}_{{(a\leadsto b)}_{{i}}}{\in}{\mathcal{P}}_{{a}{\leadsto}{b}}$ will be assigned a weight (topic similarity weight or 1). Then, all calculated weights for meta-paths will be used to form a column vector, denoted as: $\overrightarrow{{{W}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}}$ with each row presents for the weight of each meta-path corresponding to matrix ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}$ . Next, we train our W-MPP2Vec model by applying single-hidden-layer neural network (NN). The activation function of W-MPP2Vec model is used to compute the occurrence probability of meta-path ${\mathcal{P}}_{{i}}$ between ( ${a}$ ) and ( ${b}$ ), which are linked by set of existent meta-paths ${\mathcal{P}}_{{a}{\leadsto}{b}}$ , denoted as: $\text{Prob}({\mathcal{P}}_{{i}}|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}}\rangle)$ (see Eq. (2)). The activation function is defined as following (as shown in Eq. (4)):

$\displaystyle\text{Prob}({\mathcal{P}}_{i}|\langle a,b,{\mathcal{P}}_{a% \leadsto b}\rangle)=\textit{softmax}(X_{a}\overrightarrow{a}\cdot X_{b}% \overrightarrow{b}\cdot\sigma(X_{{\mathcal{P}}_{a\leadsto b}}\overrightarrow{{% \mathcal{P}}_{a\leadsto b}}\cdot\overrightarrow{W_{{\mathcal{P}}_{a\leadsto b}% }}))$ (4)

Where,

•
$\overrightarrow{W_{{\mathcal{P}}_{{a}{\leadsto}{b}}}}$ , presents for the weight column vector of meta-paths between ( ${a}$ ) and ( ${b}$ ), ${\mathcal{P}}_{{a}{\leadsto}{b}}$ .
•
$\sigma$ , presents for the sigmoid function, with $\sigma({{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{% {a}{\leadsto}{b}}})=\frac{1}{1+e^{-({{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}% \overrightarrow{{\mathcal{P}}_{{a}{\leadsto}{b}}}{\cdot}\overrightarrow{W_{{% \mathcal{P}}_{{a}{\leadsto}{b}}}})}}$ .

In general, from a given HIN, we generate the training data, denoted as ${T}$ which is a set of training tuples ( ${t,t}{\in}{T}$ ). Each training tuple ( ${t}$ ) is structured as the pattern ${t=}\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{{i}})\rangle$ , with $E({\mathcal{P}}_{{i}})$ receives a binary value (0, 1) indicate the existence of meta-path ${\mathcal{P}}_{{i}}$ between ( ${a}$ ) and ( ${b}$ ). For the network representation learning phase, to optimize the parameters of W-MPP2Vec model we define the objective function, denoted as: ${\mathcal{O}}_{\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{% {i}})\rangle}$ as following (as shown in Eq. (5)):

$\displaystyle\mathcal{O}\propto\log\mathcal{O}=\sum^{T}_{t=\langle a,b,{% \mathcal{P}}_{a\leadsto b},E({\mathcal{P}}_{i})\rangle,t\in T}{\log{\mathcal{O% }}_{\langle a,b,{\mathcal{P}}_{a\leadsto b},E({\mathcal{P}}_{i})\rangle}}$ (5)

The objective ${\mathcal{O}}_{\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{% {i}})\rangle}$ is obtained by the maximize probability likelihood of each training tuple, with ${\mathcal{O}}_{\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{% {i}})\rangle}=\left\{\begin{array}[]{l}\text{Prob}({\mathcal{P}}_{{i}}|\langle% {a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}}\rangle),E({\mathcal{P}}_{{i}})=1\\ 1-\text{Prob}({\mathcal{P}}_{{i}}|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}% }\rangle),E({\mathcal{P}}_{{i}})=0\\ \end{array}\right.$ . For the ${\log}{\mathcal{O}}_{\langle{a,}{b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({% \mathcal{P}}_{{i}})\rangle}$ is computed as: ${E}({\mathcal{P}}_{{i}}){\text{logProb}}({\mathcal{P}}_{{i}}|\langle{a,b,}{% \mathcal{P}}_{{a}{\leadsto}{b}}\rangle){+}[{1-E}({\mathcal{P}}_{{i}})]{\log}{}% {[1-\text{Prob}}({\mathcal{P}}_{{i}}|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}% {b}}\rangle)]$ . Finally, the stochastic gradient descent (SGD) is applied to maximize the objective function ${\mathcal{O}}_{\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{% {i}})\rangle}$ , the weights of latent spaces ${{X}}_{{a}}\overrightarrow{{a}}$ , ${{X}}_{{b}}\overrightarrow{{b}}$ and ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ are adjusted depending on the gradients of ${\mathcal{O}}_{\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{% {i}})\rangle}$ via back propagation process with ${\eta}$ is the learning rate, as following equations (see Eqs (6)–(8):

$\displaystyle X_{a}\overrightarrow{a}=X_{a}\overrightarrow{a}+\eta\frac{{\log% \mathcal{O}}_{\langle a,b,{\mathcal{P}}_{a\leadsto b},E({\mathcal{P}}_{i})% \rangle}}{X_{a}\overrightarrow{a}}$ (6) $\displaystyle X_{b}\overrightarrow{b}=X_{b}\overrightarrow{b}+\eta\frac{{\log% \mathcal{O}}_{\langle a,b,{\mathcal{P}}_{a\leadsto b},E({\mathcal{P}}_{i})% \rangle}}{X_{b}\overrightarrow{b}}$ (7) $\displaystyle X_{{\mathcal{P}}_{a\leadsto b}}\overrightarrow{{\mathcal{P}}_{a% \leadsto b}}=X_{{\mathcal{P}}_{a\leadsto b}}\overrightarrow{{\mathcal{P}}_{a% \leadsto b}}+\eta\frac{{\log\mathcal{O}}_{\langle a,b,{\mathcal{P}}_{a\leadsto b% },E({\mathcal{P}}_{i})\rangle}}{X_{{\mathcal{P}}_{a\leadsto b}}\overrightarrow% {{\mathcal{P}}_{a\leadsto b}}}$ (8)

The feedforward and back propagation processes are repeated until the learning model is converged (illustrated in Fig. 7). The pseudo code for overall steps of our W-MPP2Vec model is described in Algorithm 1.

Algorithm 1. Pseudo code for undirected same-typed node random walk

Input: a given heterogeneous network, denoted as: ${G=(V,E)}$ , expected predictive relation ( ${\mathcal{P}}_{{i}}$ ), number of walks ( ${w}$ ), walk length ( ${l}$ ), negative batch size (neg_size), dimension of embedding vector ( ${d}$ ) and learning rate ( $\eta$ ).

Output: node embedding ${{X}}_{{V}}$ for each node ${v}{\in}{V}$

1: Set: training set $\bm{{T}}=[]$

2: Generate: pairwise nodes $\text{walks}=\{{\langle{a,b}\rangle}_{{1}},\dots{\langle{a,b}\rangle}_{{n}}\}$ by using random walk on ${G}$ with given parameters ( ${w}$ ) and ( ${l}$ ).

3: For pairwise node $\langle{a,b}\rangle$ inwalks:

4: Generate: $t=\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{{i}})\rangle$

5: Generate: negative entries ${\text{neg}}_{{t}}{\leftarrow}{t}$ from ${G}$ , where

${\text{neg}}_{{t}}=\{\langle a^{\prime},b^{\prime},\mathcal{P}^{\prime}_{a% \leadsto b},E({\mathcal{P}}_{{i}})^{\prime}\rangle_{{1}},{\dots}\langle{{a^{% \prime}}},{{b^{\prime}}},{\mathcal{P}}^{\prime}_{{a}{\leadsto}{b}},E(\mathcal{% P}_{{i}})^{\prime}\rangle_{\text{neg\_size}}\}$ .

6: For meta-path: ${\mathcal{P}}_{{(a\leadsto b)}_{{i}}}$ in ${\mathcal{P}}_{{a}{\leadsto}{b}}$

7: Calculate: ${\text{w\_topsim}}({\mathcal{P}}_{{(a\leadsto b)}_{{i}}})$ (Eq. (1))

8: End for

9: Add: ${T}{\leftarrow}{t}$

10: End for

11: Set: iteration $=$ 0

12: While iteration ${\geqslant}$ max_ iteration:

13: Sampling: $\text{Prob}({\mathcal{P}}_{{i}}|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}}\rangle)$ from each training tuple ( ${t,t}{\in}{T}$ ) (Eq. (4)).

14: Generate: embedding matrices ${{X}}_{{a}}\overrightarrow{{a}}$ , ${{X}}_{{b}}\overrightarrow{{b}}$ , ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ with corresponding vector dimension ( ${d}$ ).

15: Update: ${{X}}_{{a}}\overrightarrow{{a}}$ , ${{X}}_{{b}}\overrightarrow{{b}}$ (Eqs (6) and (7))

16: Update: ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ (Eq. (8))

17: Update: iteration $+=$ 1

18: End while

Figure 7.
Illustrations feedforward and back propagation processes of W-MMP2Vec model.

To generate training set for W-MPP2Vec model, we apply random walk mechanism (Algorithm 1 – line 2) which is guided by the expected predicting meta-path ${\mathcal{P}}_{{i}}$ . Similar to other network embedding approaches, the random walks of W-MPP2Vec are two parameters, including: number of walks ( ${w}$ ) and walk length ( ${l}$ ). Starting at a specific node, the random walk mechanism will support to capture for each source node a ${l}$ -number of related nodes which are connected to source node via ${\mathcal{P}}_{{i}}$ within ${w}$ -hop neighborhood. The use of random walk to generate sets of pairwise nodes instead of enumerating all node pairs in a given HIN is considered as computationally efficient for both computing resource usage as well as time complexity. For model learning, we adopted the ideas of negative sampling of Word2Vec model [32] to speed up the model training processes. In general, the random walk mechanism supports to generate positive samples which are used as the input data for the neural network. Then, for each training tuple $\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{{i}})\rangle$ , we use negative sampling technique to get corresponding entries for each training tuple, in forms of $\langle{a^{\prime},b^{\prime},}{\mathcal{P}}^{\prime}_{{a}{\leadsto}{b}},E({% \mathcal{P}}_{{i}})^{\prime}\rangle$ , where ${a^{\prime}}$ , ${b^{\prime}}$ , set of relations: ${\mathcal{P}}^{\prime}_{{a}{\leadsto}{b}}$ and target expected relation ${\mathcal{P}}_{{i}}$ are randomly selected from a given network. These generated negative entries indicate that other pairwise nodes $\langle{a^{\prime},b^{\prime}}\rangle$ have specific sets of relations ${\mathcal{P}}^{\prime}_{{a}{\leadsto}{b}}$ but don’t have the ${\mathcal{P}}_{{i}}$ relation (see Algorithm 1 – line 5). For each training tuple we generate a specific number of negative entries depending on the initial negative batch size (neg_size) parameter, normally 5.
4. Experiments and discussions

Algorithm 1. Pseudo code for undirected same-typed node random walk
Input: a given heterogeneous network, denoted as: ${G=(V,E)}$ , expected predictive relation ( ${\mathcal{P}}_{{i}}$ ), number of walks ( ${w}$ ), walk length ( ${l}$ ), negative batch size (neg_size), dimension of embedding vector ( ${d}$ ) and learning rate ( $\eta$ ).
Output: node embedding ${{X}}_{{V}}$ for each node ${v}{\in}{V}$
1: Set: training set $\bm{{T}}=[]$
2: Generate: pairwise nodes $\text{walks}=\{{\langle{a,b}\rangle}_{{1}},\dots{\langle{a,b}\rangle}_{{n}}\}$ by using random walk on ${G}$ with given parameters ( ${w}$ ) and ( ${l}$ ).
3: For pairwise node $\langle{a,b}\rangle$ inwalks:
4: Generate: $t=\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}},E({\mathcal{P}}_{{i}})\rangle$
5: Generate: negative entries ${\text{neg}}_{{t}}{\leftarrow}{t}$ from ${G}$ , where
${\text{neg}}_{{t}}=\{\langle a^{\prime},b^{\prime},\mathcal{P}^{\prime}_{a% \leadsto b},E({\mathcal{P}}_{{i}})^{\prime}\rangle_{{1}},{\dots}\langle{{a^{% \prime}}},{{b^{\prime}}},{\mathcal{P}}^{\prime}_{{a}{\leadsto}{b}},E(\mathcal{% P}_{{i}})^{\prime}\rangle_{\text{neg\_size}}\}$ .
6: For meta-path: ${\mathcal{P}}_{{(a\leadsto b)}_{{i}}}$ in ${\mathcal{P}}_{{a}{\leadsto}{b}}$
7: Calculate: ${\text{w\_topsim}}({\mathcal{P}}_{{(a\leadsto b)}_{{i}}})$ (Eq. (1))
8: End for
9: Add: ${T}{\leftarrow}{t}$
10: End for
11: Set: iteration $=$ 0
12: While iteration ${\geqslant}$ max_ iteration:
13: Sampling: $\text{Prob}({\mathcal{P}}_{{i}}\|\langle{a,b,}{\mathcal{P}}_{{a}{\leadsto}{b}}\rangle)$ from each training tuple ( ${t,t}{\in}{T}$ ) (Eq. (4)).
14: Generate: embedding matrices ${{X}}_{{a}}\overrightarrow{{a}}$ , ${{X}}_{{b}}\overrightarrow{{b}}$ , ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ with corresponding vector dimension ( ${d}$ ).
15: Update: ${{X}}_{{a}}\overrightarrow{{a}}$ , ${{X}}_{{b}}\overrightarrow{{b}}$ (Eqs (6) and (7))
16: Update: ${{X}}_{{\mathcal{P}}_{{a}{\leadsto}{b}}}\overrightarrow{{\mathcal{P}}_{{a}{% \leadsto}{b}}}$ (Eq. (8))
17: Update: iteration $+=$ 1
18: End while

In this section, we perform comprehensive experimental studies to demonstrate the effectiveness of our proposed W-MPP2Vec model for multiple meta-path-based heterogeneous information network representation learning via solving fundamental network mining tasks, including link prediction and node similarity search. The experimental results show the outperformance of our proposed framework by comparing with recent state-of-the-art network embedding techniques, including DeepWalk [19], LINE [20] (LINE_1, LINE_2), Node2Vec [22], PTE [21], Metapath2Vec [17], Metagraph2Vec [18] and PME [3].

4.1 Experimental settings and dataset usage

4.1.1 Experimental dataset usage

We perform our experiments on three main real-world datasets which are DBLP bibliographic network and MovieLens100K movie network and BlogCatalog social network. These networked datasets are:

•
DBLP1
¹
DBLPdataset: http://dblp.uni-trier.de/.

bibliographic network: is considered as the most well-known bibliographic network which is used mostly for experimental and comparative studies on different network analysis and mining models. For the content of papers in DBLP, we collected over 1.5M abstract’s contents of DBLP’s papers on AMiner repository.2
²
AMiner dataset: https://aminer.org/.

Then, we used the LDA topic model to extract the distribution of topics (number of topic, $k=10$ ) over the 1.5M abstract’s contents of DBLP’s papers. These topic distributions are then used to compute the topic similarity weights of related meta-paths. In overall, our used DBLP network for experimental studies in this paper has about 1.5M papers, 2.2M authors, 5,560 venues/conferences and 1,605 journals.
•
MovieLens100K3
³
MovieLens100K: https://grouplens.org/datasets/movielens/.

movie network: contains about 100K rating of 1000 users for over 1700 movies belong to 18 genres. For the descriptions of 1.7K movies, we collected them from the IMDM4
⁴
IMDB website: https://www.imdb.com/.

and TMDB5
⁵
TMDB website: https://www.themoviedb.org/.

repositories, then also applied LDA topic model to extract the topic distributions from these movies’ description contents. The number of extracted topics from 1.7K movies is equal to the number of indexed genres ( $k=18$ ). Similar to the DBLP dataset usage, these topic distributions are also used for calculating the topic similarity weights of specific defined meta-paths in our experiments.
•
BlogCatalog6
⁶
BlogCatalog: http://socialcomputing.asu.edu/datasets/BlogCatalog.

social network: is a social network dataset which has no text-based node. The purpose of using this non-content-based HIN is to fairly evaluate the aspect of combination multiple meta-paths of our W-MPP2Vec for solving link prediction in comparing with recent network embedding techniques. This dataset is developed and released by ASU7
⁷
Arizona State University (ASU): https://www.asu.edu/.

which contains over 10K users, 39 groups which are linked via more than 300K user-user and 14K user-group relations.

For the content-based HINs like DBLP and MovieLens100K, we apply both topic similarity weights of evaluated meta-paths and combination of multiple meta-paths in solving several network mining tasks including link prediction, node similarity search and clustering. With the non-content-based social network BlogCatalog, we only apply combination of multiple meta-paths aspect of W-MPP2Vec in training the network’s node representations in order to show the efficacy and efficiency of our proposed embedding framework.
4.1.2 Experimental settings

We test our W-MPP2Vec model against 7 state-of-the-art information network representation learning models which are DeepWalk, LINE, Node2Vec, PTE, Metapath2Vec, Metagraph2Vec and PME. These models are setup as following:

•
DeepWalk [19]: is a primary network embedding model which captures the contextual nodes of each target network’s node by using uniform random walk mechanism with w-hop neighbors. DeepWalk enables to learn ${|d|}$ -dimensional node vectors from the given information network. DeepWalk is also considered as edge2edge embedding approach. In fact, the DeepWalk model is considered as HoIN-based INE model which treats all node and link as the same type. In this experiment, we applied DeepWalk to embed the network’s nodes via specific meta-paths only which means the given HINs are converted to HoINs (Hin2HoIN) with a specific type of node and link (in form of evaluate meta-path). The vector dimension ( ${d}$ ) is set as same with other embedding models.
•
LINE [20]: similar to DeepWalk on using contextual nodes for node representation, LINE has two versions called LINE_1 and LINE_2 which are applied for first-order and second-order proximities of nodes in the given information networks, respectively. LINE is also a HoIN-based INE which is unable to differentiate the diversity in types of nodes and links. For experiments, LINE is also applied Hin2HoIN and same vector dimension ( ${d}$ ) other embedding models.
•
PTE [21]: this model is considered as an extended version of LINE which is designed to applied for HINs by splitting the heterogeneous network into multiple separated bipartite sub-networks. Then, it applies embedding mechanism of LINE to learn the representations of network’s nodes. To setup PTE for experiment in this paper, we also apply sub-network partitioning mechanism of PTE corresponding with specific evaluated meta-path in each experiment. Same embedding vector dimension ( ${d}$ ) is used like other models.
•
Node2Vec [22]: is HoIN-based INE model uses BFS (breath-first-sampling) and DFS (deep-first-sampling) with two initialized parameters ( ${p,q}$ ) in random walk mechanism to extract the contextual nodes of each target node. Node2Vec is considered as node2node embedding approach. Node2Vec is set up for experiments by applying Hin2HoIN with each specific evaluated meta-path, same vector dimension ( ${d}$ ) other embedding models and the default initialized model’s parameters ( ${p=1,q=1}$ ).
•
Metapath2Vec [17]: this is an HIN-based INE model which uses meta-path to guide the random walk process for generating contextual nodes of each target node in HINs. The Metapath2Vec uses heterogeneous skip-grams and negative sampling on same-typed nodes to speed up the network representation training process. The Metapath2Vec has two versions: Metapath2Vec and Metapath2Vec++ which are mainly applied for homogeneous and heterogeneous network embedding approaches, respectively. We only used Metapath2Vec++ version for this experimental study. In experiment, Metapath2Vec model uses the same number of walk ( ${w}$ ), walk length ( ${l}$ ), negative sampling batch size (neg_size) and embedding vector dimension ( ${d}$ ) with other models.
•
Metagraph2Vec [18]: is similar to Metapath2Vec, the Metagraph2Vec defined a novel meta-graph-based random walk mechanism for the process of generating contextual nodes. Metagraph2Vec is considered as better than Metapath2Vec model in solving problems which are related to short-walks in HINs. Metagraph2Vec also has two versions for homogeneous and heterogeneous network embedding approaches which are Metagraph2Vec and Metagraph2Vec++, respectively. We only used Metagraph2Vec++ version for this experimental study. With experiments for HIN embedding tasks, Metagraph2Vec uses the same number of walk ( ${w}$ ), walk length ( ${l}$ ), negative sampling batch size (neg_size) and embedding vector dimension ( ${d}$ ) with other models.
•
PME [3]: is considered as very recent HIN-based INE model for link prediction task. PME enables to learn the latent network’s node ( ${{v}}_{{i}}$ ) and relation ( ${r}$ ) representation via projected node embedding matrix ${{{v}}^{{r}}_{{i}}=M}_{{r}}{\cdot}{{v}}_{{i}}$ . The PME uses the constructed relation-specific projection matrices to predict the existence of relations in forms of meta-paths between pairwise nodes. The PME model is setup with the same embedding vector dimension ( ${d}$ ) with other models.

To sum up, all network embedding models are setup for experiments with the global model’s parameters as following (as shown in Table 1).

Table 1
Experimental settings for network embedding

Embedding vector dimension ( ${d}$ ) Number of walks per node ( ${w}$ ) Negative sampling batch size (neg_size) The size of walk length ( ${l}$ )

128 1000 5 120

4.2 Experimental results and discussions

In this part, we demonstrate comprehensive experiments on three fundamental network mining tasks (link prediction, node similarity search and clustering) by using different network embedding models.

4.2.1 Link prediction task

Experimental setup. We model the link prediction task in HIN by using network embedding approach as a binary classification task to predict the existence of new relations in forms of meta-paths between two pairwise nodes. The link prediction task is solved by multiple network embedding models within three datasets. Each dataset ( ${D}$ ) is divided into two main parts: ${{D}}_{\textit{train}}$ and ${{D}}_{\textit{test}}$ . After that the training and testing parts will be applied several network embedding techniques to produce the node representations, denoted as: ${{X}}_{\textit{train}}$ and ${{X}}_{\textit{test}}$ , respectively. Finally, we use the node’s representations in ${{X}}_{\textit{train}}$ to feed out-of-the-shelf classification models (like as Logistic Regression, SVM, etc.). To form the training set for the classification models, sets of pairwise nodes are generated with the corresponding binary labels. Each pairwise node receives label 1 if it has the expected predicting relation ( ${\mathcal{P}}_{{i}}$ ) otherwise is 0. We use Hadamard product ( ${\circ}$ ) to form a feature training vector ( $\overrightarrow{{{f}}_{\textit{train}}}$ ) for each training pairwise node $\langle{a,b}\rangle$ , denoted as: $\overrightarrow{{{f}}_{\textit{train}}}=\overrightarrow{{a}}{\circ}% \overrightarrow{{b}}$ . Finally, the trained classification model is used to predict the existence of relations between pairwise nodes in ${{D}}_{\textit{test}}$ which are not occurred in ${{D}}_{\textit{train}}$ yet. For all experiments of link prediction, we use logistic regression (LR) algorithm. Similar to the process of producing training feature vector, each testing pairwise node is also used to produce feature testing vector ( $\overrightarrow{{{f}}_{\textit{test}}}$ ) by Hadamard product ( ${\circ}$ ). The link prediction task is now become a binary classification problem which try to predict the label (1 has ${\mathcal{P}}_{{i}}$ , otherwise 0) for each testing feature vector ( $\overrightarrow{{{f}}_{\textit{test}}}$ ). To validate outputs, the set of existent relations in ${{D}}_{\textit{test}}$ is applied to validate against the link prediction results of our trained classification model. The methods for splitting datasets are different depending on their network’s structure, node’s and link’s features.

Experimental metrics usage. To evaluate the performance of network embedding models in solving link prediction task, we use three main evaluation metrics: F-1 measure (Eq. (9)), Mean Average Precision (MAP) (Eq. (10)).

$\displaystyle F1=2\frac{PR}{P+R}$ (9)

Where,

•

$P$ (precision): the precision of a given class in classification model, denoted as: $P=\frac{{TP}}{{TP+FP}}$ .

•

$R$ (recall): measure the true positive rate or sensitivity of a given class in classification model, denoted as: $R=\frac{{TP}}{{TP+FN}}$ .

$\displaystyle\textit{MAP}=\frac{\textit{AvgP}(q)}{Q}$ (10)

Where,

•

${Q}$ , presents total number of query.

•

${\textit{AvgP}(q)}$ , presents the percentage of top-k ranking outputs for a query ( ${q}$ ) that hit the ground truth over k, denoted as: ${\textit{AvgP}}({q})=\sum^{{k}}_{{i=1}}{\frac{{p@k(q)}}{{i}}}$ .

Figure 8.

Meta-graphs of different datasets which are used in experiments for Metagraph2Vec model.

Experiments on DBLP dataset. For DBLP bibliographic network, we use several network embedding models to solve the co-authorship (A-P-A) prediction task. The ${{D}}_{\textit{train}}$ and ${{D}}_{\textit{test}}$ of DBLP network are divided depending on the publication years of associated with each papers. The training set ( ${{D}}_{\textit{train}}$ ) contains papers and relationships with related nodes (authors, affiliations, venues, journals) from 1985 to 2005 and the training set ( ${{D}}_{\textit{test}}$ ) contains papers which are published from 2006 to now (2019). Multiple network embedding models are applied to embed author nodes of both datasets. For the HoIN-based INE models, including: DeepWalk, LINE (LINE_1, LINE_2), PTE and Node2Vec we applied the A-P-A meta-path to form a HoIN with only author and authorship relation. For HIN-based INE models, including: Metapath2Vec and PME also use A-P-A meta-path as the main meta-path for training the network embedding model. The Metagraph2Vec uses a meta-graph which is a combination of two meta-paths (A-O-A and A-P-V-P-A) (Fig. 8A). To demonstrate the effectiveness of multiple meta-path combination and topic similarities between text-based nodes as meta-path’s weights, the W-MPP2Vec is implemented with two combined meta-paths ${{P}}_{{a}{\leadsto}{b}}$ : A-P-V-P-A and A-O-A to predict the existence of ${\mathcal{P}}_{{i}}$ : A-P-A. The topic similarities for paper nodes are used as the weights for path instances of meta-path A-P-V-P-A. Table 2 shows average 10-fold cross validation experimental outputs for co-authorship link prediction on DBLP bibliographic network via multiple network embedding techniques.

Table 2

Performance evaluation of co-authorship A-P-A link prediction on DBLP dataset in terms of MAP and F1 metrics

	MAP	F1
DeepWalk	0.16723	0.68293
LINE_1	0.13728	0.55621
LINE_2	0.14721	0.58271
PTE	0.28721	0.64213
Node2Vec	0.31828	0.66212
Metapath2Vec	0.42612	0.72812
Metagraph2Vec	0.43213	0.73261
PME	0.47261	0.74472
W-MMP2Vec	0.49218	0.75268

Table 3

Experimental outputs for co-authorship A-P-A link prediction on DBLP dataset with different training set size (%) in terms of Micro-F1 and Macro-F1

Training set (%)	10%	20%	30%	40%	50%	60%	70%	80%	90%
Metapath2Vec
Micro-F1	0.39827	0.43281	0.44291	0.45872	0.53827	0.56881	0.66882	0.71927	0.74982
Marco-F1	0.38445	0.42827	0.43682	0.44682	0.52212	0.55281	0.65281	0.72812	0.74621
Metagraph2Vec
Micro-F1	0.39281	0.41827	0.44681	0.47281	0.54927	0.56218	0.66217	0.73382	0.75281
Marco-F1	0.38271	0.42733	0.44291	0.46281	0.53281	0.54821	0.65812	0.74081	0.74821
PME
Micro-F1	0.44212	0.47928	0.52092	0.54782	0.59287	0.65281	0.69212	0.73912	0.78257
Marco-F1	0.43728	0.48271	0.52937	0.54281	0.58271	0.64332	0.68921	0.74291	0.77582
W-MPP2Vec
Micro-F1	0.45728	0.51728	0.56281	0.57928	0.60283	0.66782	0.70281	0.75212	0.79523
Marco-F1	0.47281	0.50281	0.55821	0.57281	0.61827	0.65829	0.69291	0.76821	0.78264

Figure 9.

Co-authorship A-P-A link prediction accuracy in term of Micro-F1.

Figure 10.

Co-authorship A-P-A link prediction accuracy in term of Macro-F1.

As shown from the experimental outputs in Table 2, our proposed W-MPP2Vec model achieves significantly better node representations for link prediction task than both HoIN-based and HIN-based state-of- the-art embedding models. In comparing with HoIN-based network embedding models like as DeepWalk, LINE_1, LINE_2, PTE and Node2Vec, W-MPP2Vec achieves relatively 12.03% and 23.27% improvements in terms of F1 measure and MAP metrics, respectively. Against HIN-based embedding models, like as Metapath2Vec, Metagraph2Vec and PME, the W-MPP2Vec model obtains better about 15.5% and 3.37% (Metapath2Vec), 13.89% and 2.73% (Metagraph2Vec), 4.14% and 1.06% (PME) in terms of F1 measure and MAP metrics, respectively. To test the stability of network embedding model with different size of dataset, we vary the size of the training set ( ${{D}}_{\textit{train}}$ ) of DBLP network from 10% to 90% (as shown in Figs 9 and 10) and calculate the average performance in terms of Micro-F1 and Macro-F1 as shown in Table 3.

Experiments on MoviesLen100K & BlogCatalog datasets. For each dataset, we split the networks into two parts with different portions: training ${{D}}_{\textit{train}}$ (from 10% to 90%) and testing ${{D}}_{\textit{test}}$ (from 10% to 30%). Similar to the experiments on DBLP network, we train our prediction model with the ${{D}}_{\textit{train}}$ and for the ${{D}}_{\textit{test}}$ we removed all target predicted relations ( ${\mathcal{P}}_{{i}}$ ). Then, we applied network embedding techniques to transform pairwise nodes of both ${{D}}_{\textit{train}}$ and ${{D}}_{\textit{test}}$ to feature vectors ( $\overrightarrow{{{f}}_{\textit{train}}}$ , $\overrightarrow{{{f}}_{\textit{test}}}$ ) via Hadamard product. The LR classifier is applied to produce the link prediction model from the feature vectors of ${{D}}_{\textit{train}}$ and then predict the existence of expected relations ${\mathcal{P}}_{i}$ in each network. For MoviesLen100K network, we apply network embedding approach to solve the problem of predicting U-M-U relations (two users watch and rate for the same set of movies). The W-MPP2Vec is trained with two combined meta-paths $P_{a\leadsto b}$ : U-M-A-M-U and U-M-G-M-U to predict the existence of ${\mathcal{P}}_{i}$ : U-M-U relations. Both U-M-A-M-U and U-M-G-M-U meta-paths have weights which are the topic similarity weights between sets of movie’s descriptions. In this dataset, HoIN-based embedding models (DeepWalk, LINE, PTE and Node2Vec) are applied Hin2HoIN transformation technique with meta-path U-M-U as the main relation between user nodes. With PME and Metapath2Vec models, we also used U-M-U meta-path and a combination of two meta-paths: U-M-U and U-M-G-M-U to form a meta-graph which is used for Metagraph2Vec model (Fig. 8B). Similar to MoviesLen100K, with BlogCatalog network we also apply network embedding approach to solve the friendship relation prediction task, in form of meta-path ${\mathcal{P}}_{i}$ : U-U and W-MPP2Vec is trained with a combined meta-path $P_{a\leadsto b}$ : U-G-U which present the relation of two users who have participated/joined in common groups. Same with W-MPP2Vec model, the meta-path U-U is used as the main relation for all HoIN-based models (DeepWalk, LINE, PTE and Node2Vec) and HIN-based models (Metapath2Vec and PME). For Metagraph2Vec model, we used a combination of U-U and U-G-U meta-paths to form a meta-graph (Fig. 8C).

For two datasets, we vary the size of the training set ( ${{D}}_{\textit{train}}$ ) from 10% to 90% and the remaining network for testing ( ${{D}}_{\textit{test}}$ ). The experiments on each dataset are applied 10-fold cross validation and report the average performance in terms of MAP and F1 metrics. Table 4 shows the experimental outputs of U-G-U and U-U link prediction tasks on MoviesLen100K and BlogCatalog networks.

Table 4

Performance evaluation of U-G-U and U-U link predictions on MoviesLen100K and BlogCatalog datasets in terms of MAP and F1 metrics

	MoviesLen100K		BlogCatalog
	MAP	F1	MAP	F1
DeepWalk	0.26724	0.67281	0.16829	0.62172
LINE_1	0.18271	0.58271	0.12762	0.52718
LINE_2	0.16281	0.562827	0.13226	0.54027
PTE	0.29671	0.67281	0.182561	0.58271
Node2Vec	0.32627	0.69275	0.21726	0.61562
Metapath2Vec	0.37682	0.75621	0.35672	0.76571
Metagraph2Vec	0.38774	0.768271	0.34582	0.75645
PME	0.39281	0.77572	0.34829	0.75872
W-MMP2Vec	0.40827	0.78261	0.36271	0.77268

Table 5

Experimental outputs for U-M-U link prediction on MoviesLen100K dataset with different training set size (%) in terms of Micro-F1 and Macro-F1

Training set (%)	10%	20%	30%	40%	50%	60%	70%	80%	90%
Metapath2Vec
Micro-F1	0.45261	0.48271	0.54261	0.55726	0.65718	0.67281	0.75281	0.74627	0.76582
Marco-F1	0.44281	0.47829	0.53271	0.54271	0.64612	0.66728	0.73925	0.75627	0.75921
Metagraph2Vec
Micro-F1	0.45827	0.49281	0.55686	0.58582	0.66826	0.69827	0.75261	0.77261	0.77281
Marco-F1	0.43991	0.48824	0.54281	0.58271	0.65225	0.68291	0.74271	0.76271	0.76281
PME
Micro-F1	0.51482	0.56271	0.62923	0.65217	0.69271	0.74621	0.77216	0.78527	0.79271
Marco-F1	0.50281	0.56028	0.61261	0.64271	0.68271	0.73275	0.76271	0.78271	0.80213
W-MPP2Vec
Micro-F1	0.54829	0.60272	0.64281	0.67821	0.70172	0.75822	0.77286	0.79271	0.82612
Marco-F1	0.53722	0.58971	0.63382	0.66261	0.69281	0.74261	0.76268	0.78271	0.81282

The experimental outputs (as shown in Table 4) shows the W-MPP2Vec model also achieves better performance in compare with recent state-of-the-art network representation models in both context-based (MoviesLen100K) and non-content-based (BlogCatalog) datasets. With content-based MoviesLen100K, W-MPP2Vec tremendously outperforms HoIN-based techniques about 25.13% and 12.97% (Node2Vec), 37.59% and 16.31% (PTE), 23.71% and 36.67% (average LINE_2 and LINE_2), 52.77% and 16.31% (DeepWalk) in terms of MAP and F1 metrics, respectively. With HIN-based approaches, our proposed W-MPP2Vec model also slightly improves the model’s accuracy for link prediction task approximately 3.93%, 5.29% and 8.34% in comparing PME, Metagraph2Vec and Metapath2Vec, respectively. For experiments on non-content-based BlogCatalog network for predict the user’s friendship (U-U) relation, W-MPP2Vec model outperforms around 21.9% and 33.79% than HoIN-based models in terms of MAP and F1 metrics. Comparing with recent HIN-based embedding models, W-MPP2Vec also consistently achieves higher performance averagely 3.54% and 1.62% in terms of MAP and F1 metrics, respectively. With these two datasets, we also evaluated the stability each embedding model by testing each one on different size of training set (varying from 10% to 90%) (as shown Fig. 11). Tables 5 and 6 show the experimental outputs for U-M-U and U-U link predictions on MoviesLen100K and BlogCatalog networks, respectively. As shown from experimental results on these tables, our proposed W-MPP2Vec framework can work stably on different size of training sets which ensure the capability of effective implementations on different types of network’s structure. To sum up, throughout several experiments for solving link prediction task on multiple types of HINs, the results show that our proposed W-MPP2Vec model generates more effective node embedding in the context of multi-typed nodes and links than up-to-date well-known network representation learning baselines. The W-MPP2Vec leverage the accuracy of link prediction task in HIN via the combination of existent relations between pairwise nodes in forms of meta-paths and calculating the topic similarity between text-based nodes as meta-path’s weights in the process of network representation learning.

Table 6

Experimental outputs for friendship U-U link prediction on BlogCatalog dataset with different training set size (%) in terms of Micro-F1 and Macro-F1

Training set (%)	10%	20%	30%	40%	50%	60%	70%	80%	90%
Metapath2Vec
Micro-F1	0.39782	0.43728	0.50881	0.55879	0.618729	0.64928	0.68994	0.74086	0.76213
Marco-F1	0.38927	0.42721	0.49271	0.54782	0.60897	0.64721	0.68271	0.73621	0.76082
Metagraph2Vec
Micro-F1	0.40281	0.45876	0.52874	0.58281	0.63487	0.66071	0.69078	0.74612	0.77982
Marco-F1	0.39281	0.44821	0.51821	0.57271	0.62813	0.65821	0.68921	0.73627	0.76921
PME
Micro-F1	0.50281	0.56872	0.58863	0.62894	0.66865	0.68862	0.74821	0.76092	0.78627
Marco-F1	0.49271	0.55271	0.57628	0.61281	0.65821	0.67864	0.73271	0.75271	0.77482
W-MPP2Vec
Micro-F1	0.51927	0.56938	0.58927	0.63281	0.66872	0.69827	0.74608	0.77089	0.79826
Marco-F1	0.51821	0.56721	0.58271	0.62782	0.65721	0.68271	0.73821	0.76267	0.78325

Figure 11.

Experimental outputs for U-M-U and U-U link predictions on MoviesLen100K and BlogCatalog datasets, in terms of Micro-F1 and Macro-F1, respectively.

4.2.2 Node similarity search task

Experimental setup. In this part, we conduct experiments on author and venue nodes similarity search with DBLP bibliographic network. Each network embedding techniques are used to transform author and venue node to embedded vectors, then we used cosine similarity to determine the top-k relevant nodes of a specific target nodes. Similar to experiments on link prediction task, we used HoIN-based network embedding models to learn the node representations of transformed Hin2HoIN networks via A-P-V-P-A (for author similarity search) and V-P-A-P-V (for venue similarity search). Similar to HoIN-based models, to setup the node similarity search for Metapath2Vec and PME models, we used the A-P-V-P-A and V-P-A-P-V as main meta-paths for authors and venues, respectively. For the Metagraph2Vec model, we used two meta-graphs which are the combinations of two meta-paths: A-P-V-P-A, A-O-A and V-P-V-P-A, V-P-P-P-V for authors and venues similarity searches similarity, respectively.

Figure 12.

Example of subject areas/topics of “Quoc V. Le” author which are indexed by ACM Digital Library.

Figure 13.

Example of top “Artificial Intelligence” venues/journals which are index by Google Scholar.

Experiments of node similarity search on DBLP network. To evaluate the outputs of author and venues similarity measurement, we use nDCG (normalized Discounted Cumulative Gain) metric [33] which enable to measure the accuracy of the outputs for each user’s query. To rank the outputs of each node relevance query, we use the set of topics which are assigned for author and venue as the main evaluation criteria. For author’s topics, we used the set of assigned “Subject Areas” in ACM digital library. These labelled author’s topics are categorized following the ACM Computing Classification System (CCS).8

⁸

ACM Computing Classification System: https://www.acm.org/publications/class-2012.

Figure 12 illustrates subject areas or topics which are labelled for author “Quoc V. Le”.9

⁹

ACM profile of Quoc V. Le: https://dl.acm.org/author_page.cfm?id=81339511241&srt=publicationDate&role=all&dsp=subj.

For topics of venues, similar to the works of Dong et al. in Metapath2Vec model, we also used the indexed topics/subjected of Google Scholar for top venues and journals. Figure 13 shows an example of top venues/journals on “Artificial Intelligence”10

¹⁰

Top venues/journals of “Artificial Intelligence” topic of Google Scholar: https://scholar.google.com/citations?view_op=top_ venues&hl=en&vq=eng_artificialintelligence.

topics. These labelled topics of authors and venues which are extracted from trust resources (ACM, Google Scholar) are used as the ground truth for evaluating the similarity between nodes. We used the set of common topics which are share between two authors (1 author belongs to more than one topics/subjects) and venues (1 venue belongs to only 1 topic/subject) to rank the level of relevancy between two authors and venues. There are four levels of similarity score which are used to rank the outputs of node similarity search query depending on the percentage of common indexed topics (as shown in Table 7). Table 8 shows examples of top@5 authors and venues/conferences similarity searches in DBLP network by using W-MMP2Vec model.

Table 7

The scores for level of node relevancy with the given node in query

Score	Description	Percentage of common indexed topics between two nodes (%)
0	Non-relevant	${<}$ 20
1	Quite relevant	20 ${\sim}$ 49
2	Closely relevant	50 ${\sim}$ 79
3	Very/highly relevant	80 ${\sim}$ 100

Table 8

Examples of top@5 authors and venues similar search on DBLP with W-MPP2Vec model

Top@5 author similar search in DBLP
Jiawei Han	AnHai Doan	Quoc V. Le	Wei Wang	Christos Faloutsos
Jian Pei	Renée J. Miller	Alex Smola	Hong Cheng	Rakesh Agrawal
Philip S. Yu	Alon Halevy	Oriol Vinyals	WeiFan	Philip S. Yu
Christos Faloutsos	Samuel Madden	Richard Socher	Xifeng Yan	Charu C. Aggarwal
Raymond T. Ng	Alon Y. Halevy	Tomas Mikolov	Jian Pei	Jiawei Han
Dong Xin	Jayant Madhavan	Christopher D Manning	Haixun Wang	Jian Pei
Top@5 venue similar search
ICML	KDD	SIGMOD	CVPR	SIGKDD
AISTATS	SDM	VLDB	ECCV	ICDM
NIPS	TKDD	ICDE	ICCV Workshops	PKDD
ECML/PKDD	ICDM	PVLDB	BMVC	PAKDD
Machine Learning	WSDM	VLDBJ	ICPR	DMKD
ECML	CIKM	CIDR	ICCV	KDD

For author and venue node similarity search, in each test case, we randomly select 100 authors and venues from DBLP dataset and then try to find top@k relevant authors and venues. Then, the outputs for each query are applied nDCG to calculate to accuracy. These experiments are repeated 10 times and report the average performance as the final result. We performed the node similarity task by using different network embedding models to find top@5, top@10, top@20 and top@40 relevant authors and venues.

Table 9

Accuracy performance for top@k author node similarity search in DBLP via different network embedding models in terms of nDCG metric

	top@5	top@10	top@20	top@40
DeepWalk	0.78271	0.71283	0.67829	0.58271
LINE_1	0.76758	0.68271	0.65621	0.56271
LINE_2	0.77823	0.69782	0.66702	0.54281
PTE	0.81723	0.74821	0.71092	0.55281
Node2Vec	0.79271	0.76728	0.72612	0.61271
Metapath2Vec	0.95271	0.91278	0.85721	0.73621
Metagraph2Vec	0.95921	0.92719	0.86281	0.74665
PME	0.94782	0.92078	0.87261	0.73872
W-MMP2Vec	0.96728	0.93672	0.89271	0.75721

Table 10

Accuracy performance for top@k venue node similarity search in DBLP via different network embedding models in terms of nDCG metric

	top@5	top@10	top@20	top@40
DeepWalk	0.81723	0.76281	0.73782	0.66975
LINE_1	0.77872	0.73087	0.70281	0.66481
LINE_2	0.78271	0.73821	0.70672	0.67971
PTE	0.84721	0.78275	0.76261	0.70867
Node2Vec	0.83278	0.79281	0.75621	0.71869
Metapath2Vec	0.97261	0.95872	0.90281	0.89562
Metagraph2Vec	0.96271	0.95273	0.92658	0.89633
PME	0.97263	0.95721	0.92723	0.90281
W-MMP2Vec	0.98278	0.96271	0.93271	0.91673

Figure 14.

Accuracy performance for top@k authors and venues similarity search on DBLP network.

As shown in experimental outputs for author (see Table 9) and venue (see Table 10) nodes similarity search tasks in DBLP, our proposed W-MMP2Vec performs the best in comparing with both HoIN-based and HIN-based network embedding techniques. Especially for the author relevant search task, W-MMP2Vec significantly outperforms averagely 28.39% than HoIN-based models like as DeepWalk, LINE, PTE and Node2Vec. In comparing with HIN-based models, W-MMP2Vec also achieves better performance than Metapath2Vec (2.74%), Metagraph2Vec (1.66%) and PME (2.12%). Our proposed W-MMP2Vec also gain higher accuracy in similar venue search than HoIN-based and HIN-based embedding baselines approximately 26.71% and 1.39%, respectively. In summary, the experimental results have demonstrated that W-MMP2Vec model generates better node representations than recent state-of-the-art network embedding baselines for solving the similarity search task in HINs (Fig. 14).

4.2.3 Parameter sensitivity

In this part, we study the parameter sensitivity of our proposed W-MMP2Vec embedding framework by solving venue and author nodes similarity search in DBLP network. Using the accuracy of author and venue nodes similarity search in terms of nDCG as functions, we vary values of three main parameters of W-MMP2Vec which are walks per node ( $w$ ) (from 100 to 1300), walk length ( $l$ ) (from 60 to 180) and dimension of embedding vectors ( $d$ ) (from 40 to 160). As shown in Fig. 15, our W-MPP2Vec model achieves the balance with walks per node ( $w$ ) at around 1000 ${\sim}$ 1200, walk length ( $l$ ) at about 100 and dimension of embedding vector ( $d$ ) at about 110 ${\sim}$ 130.

Figure 15.

Parameters sensitivity of W-MMP2Vec model.

Figure 16.

Model scalability comparisons between different heterogeneous network embedding model.

4.2.4 Efficiency and scalability evaluations

As we known that most real-worlds heterogeneous networks are complex and have such an impressive huge number of nodes and links, might be up to billions, like as common social networks (Facebooks, Twitter, etc.). In the scenarios of real-world applications, the designed network mining models must be feasible to be applied in complex large-scaled networks. In this section, we evaluate the scalability and optimization of our proposed W-MMP2Vec model in compare with recent state-of-the-art heterogeneous network embedding models, including: Metapath2Vec, Metagraph2Vec and PME. We conducted the experiments with the default configurations (refers as in Table 1) with DBLP dataset. Each network embedding model is applied to train the representations of DBLP network with different number of processing threads from 1 to 24. All heterogeneous network embedding models are implemented and evaluated on the same computer server with Intel ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ Xeon ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ E7-8890 v4 CPU – 24 cores and 64 Gb memory. Figure 16 demonstrates the speedup rate corresponding the number of threads which are used for network representation training processes.

As shown from Fig. 16, the speedup rate of W-MMP2Vec model is quite close to the optimal points than other heterogeneous network embedding models. Depending on the experimental outputs, our proposed W-MMP2Vec model fasters than other embedding models about 2.59% (PME), 10.42% (Metapath2Vec) and 14.3% (Metagraph2Vec). The experiments in this section demonstrate the scalability and good optimization of our proposed model.

5. Conclusions

In conclusion, our works in this paper are mainly concentrated on defining a novel approach of information network embedding model which leverages the node representations by applying multiple meta-paths and topic similarity between pairwise nodes during the network representation learning processes. The W-MPP2Vec model enables to capture richer semantics of both node and their inter-connected relations in forms of meta-paths in content-based HINs. Extensive experiments demonstrate the effectiveness of our proposed topic-driven multiple meta-path-based framework in comparing with recent state-of-the-art network representation learning baselines. The W-MPP2Vec model is designed to work well on content-based HINs such as bibliographic networks (DBLP, DBIS, etc.), social networks (Facebook, Twitter, Baidu, etc.), etc. for solving multiple dependent meta-path-based link prediction task. Through thorough and comprehensive empirical studies on several real-world datasets like as DBLP, MoviesLen100K and BlogCatalog, we strongly believe that about model can be widely applied for applications related to heterogeneous network analysis and mining task. Our future works include various continuous architectural improvements and model optimizations for W-MPP2Vec to work effectively and efficiently in the context of big networked data. We all know that most of real-world HINs are composed as very large-scaled networks with a huge number of nodes (millions to billions) and uncountable relations. These such large-scaled networks are unable to handle and process on a single computer. Therefore, W-MPP2Vec need to be implemented under the distributed processing environment such as Apache Spark which enables to work effectively on very large networks.

Footnotes

Acknowledgments

This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number DS2020-26-01.

References

Pandey

Bhanodia

P.K.

Khamparia

and Pandey

D.K.

, A comprehensive survey of edge prediction in social networks: techniques, parameters and challenges, Expert Systems with Applications 124 (2019), 164–181.

Shi

Zhang

Sun

and Philip

S.Y.

, A survey of heterogeneous information network analysis, IEEE Transactions on Knowledge and Data Engineering 29(1) (2016), 17–37.

Chen

Yin

Wang

Nguyen

Q.V.H.

and Li

, PME: projected metric embedding on heterogeneous networks for link prediction, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018), pp. 1177–1186.

Wang

Zhang

Hou

Xie

Guo

and Liu

, Shine: Signed heterogeneous information network embedding for sentiment link prediction, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 592–600.

Wang

and Zhou

, Link prediction in social networks: the state-of-the-art, Science China Information Sciences 58(1) (2015), 1–38.

Gupta

Kumar

and Bhasker

, HeteClass: a meta-path based framework for transductive classification of objects in heterogeneous information networks, Expert Systems with Applications 68 (2017), 106–122.

Cui

Wang

Pei

and Zhu

, A survey on network embedding, IEEE Transactions on Knowledge and Data Engineering 31(5) (2018), 833–852.

Wang

Mao

Wang

and Guo

, Knowledge graph embedding: a survey of approaches and applications, IEEE Transactions on Knowledge and Data Engineering 29(12) (2017), 2724–2743.

Qiao

Zhao

Huang

and Chen

, A structure-enriched neural network for network embedding, Expert Systems with Applications 117 (2019), 300–311.

10.

Sun

Han

Yan

Philip

S.Y.

and Wu

, Pathsim: meta path-based top-k similarity search in heterogeneous information networks, Proceedings of the VLDB Endowment 4(11) (2011), 992–1003.

11.

Zhang

and Wang

, Top-k similarity search in heterogeneous information networks with x-star network schema, Expert Systems with Applications 42(2) (2015), 699–712.

12.

Shi

Kong

Huang

Philip

S.Y.

and Wu

, Hetesim: a general framework for relevance measure in heterogeneous networks, IEEE Transactions on Knowledge and Data Engineering 26(10) (2014), 2479–2492.

13.

Sun

Han

Zhao

Yin

Cheng

and Wu

, Rankclus: integrating clustering with ranking for heterogeneous information network analysis, in: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009, pp. 565–576.

14.

Sun

and Han

, Ranking-based clustering of heterogeneous information networks with star network, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 797–806.

15.

Guo

Liu

and Yao

, Classification by multi-semantic meta path and active weight learning in heterogeneous information networks, Expert Systems with Applications 123 (2019), 227–236.

16.

T.Y.

Lee

W.C.

and Lei

, Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017), pp. 1797–1806.

17.

Dong

Chawla

N.V.

and Swami

, metapath2vec: Scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 135–144.

18.

Zhang

Yin

Zhu

and Zhang

, MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018, pp. 196–208.

19.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.

20.

Tang

Wang

Zhang

Yan

and Mei

, Line: Large-scale information network embedding, in: Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.

21.

Tang

and Mei

, Pte: Predictive text embedding through large-scale heterogeneous text networks, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1165–1174.

22.

Grover

and Leskovec

, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD Nternational Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 855–864.

23.

Pham

and Ta

D.C.

, W-PathSim: Novel approach of weighted similarity measure in content-based heterogeneous information networks by applying LDA topic modeling, in: Asian Conference on Intelligent Information and Database Systems, Springer, Cham, 2018, pp. 539–549.

24.

and Pham

, DW-PathSim: a distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network, Journal of Information and Telecommunication 3(1) (2018), 19–38.

25.

and Pham

, W-PathSim++: the novel approach of topic-driven similarity search in large-scaled heterogeneous network with the support of Spark-based DataLog, in: 2018 10th International Conference on Knowledge and Systems Engineering (KSE), IEEE, 2018, pp. 102–106.

26.

Pham

and Do

, W-MetaPath2Vec: the topic-driven meta-path-based model for large-scaled content-based heterogeneous information network representation learning, Expert Systems with Applications 123 (2019), 328–344.

27.

Pham

and Do

, W-Metagraph2Vec: a novel approval of enriched schematic topic-driven heterogeneous information network embedding, International Journal of Machine Learning and Cybernetics 11(8) (2020), 1–20.

28.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3(Jan) (2003), 993–1022.

29.

Bordes

Usunier

Garcia-Duran

Weston

and Yakhnenko

, Translating embeddings for modeling multi-relational data, Advances in Neural Information Processing Systems, 2013, 2787–2795.

30.

Wang

Zhang

Feng

and Chen

, Knowledge graph embedding by translating on hyperplanes, in: Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014, pp. 1112–1119.

31.

Lin

Liu

Sun

Liu

and Zhu

, Learning entity and relation embeddings for knowledge graph completion, in: Twenty-ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2181–2187.

32.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, in: 1st International Conference on Learning Representations (ICLR), 2013.

33.

Järvelin

and Kekäläinen

, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems (TOIS), 2002, 422–446.

Embedding vector dimension ( ${d}$ )	Number of walks per node ( ${w}$ )	Negative sampling batch size (neg_size)	The size of walk length ( ${l}$ )
128	1000	5	120