Attributed network representation learning via DeepWalk

Abstract

Network representation learning aims at learning a low-dimensional vector for each node in a network, which has attracted increasing research interests recently. However, most existing approaches only use topology information of each node and ignore its attributes information. In this paper, we propose an Improved Attributed Node Random Walks(IANRW) framework, which constructs the neighborhood of an attributed node and then leverages the skip-gram model to perform node embeddings. The method can be able to flexibly incorporate both the topology and attribute information. Additionally, it can easily deal with missing data and be applied to large networks. Extensive experiments on six datasets show that IANRW outperforms many state-of-the-art embedding models and can improve various attributed networks mining tasks.

Keywords

Attributed network node embeddings feature learning network representations

1. Introduction

Network representation learning aims at learning a low-dimensional distributed representation vector of nodes in a network, which has proved extremely useful for a wide variety tasks [1, 2, 3, 4, 5]. As shown in Fig. 1, the network representation learning is a bridge that connects the network’s original data and network application tasks. The main idea of node embeddings is to distill the high-dimensional information about a node to a dense vector embedding. The vector embeddings can then be used to deal with tasks such as node classification, link prediction, clustering and social recommendation [2, 3, 4]. Node classification and link prediction problems in networks both require to construct a vector representation for the nodes. Usual methods are to learn the vector by defining an objective function and solving the corresponding optimization problem. These methods must make a trade-off in balancing computational efficiency and predictive accuracy. Alternatively, some design an objective which only focus on preserving local neighborhoods of nodes.

Figure 1.

The flow chart of network representation learning.

Up to now, most of previous works such as DeepWalk [6], LINE [7], Node2Vec [8] and SDNE [9] have been designed for plain networks. They only investigate the topology information of networks and ignore the node attributes which are potentially complementary in learning. They assume that nodes with similar topology context should be distributed closely in the learned representation space. However, in real-word networks, there always exit attributes information of nodes. For example, users in social networks may have profiles like gender, age and address. Node attributes information are important to measure the similarity between nodes. It is very likely that two users share very similar interests, but they are not connected in social networks. As the node attribues encode much information of networks, integrating them into the embedding process is expected to achieve a better performance. Moreover, by utilizing the auxiliary attributes information, the link sparsity can also be alleviated. Some recent works try to incorporate node attributes and community structure into the network embeddings [10, 11, 12, 13]. TADW [14] as the first attempt to incorporate the text features of nodes into network embedding process under a framework of matrix factorization. However, it is very time and memory consuming and cannot be used to large networks. How do we effectively preserve the concept of “word-context” among attributed networks? Can random walks, as used in DeepWalk and node2vec, be applied to attributed networks? Can we directly apply network-oriented embedding architectures (e.g., skip-gram) to attributed networks?

To address the above challenges, in this paper, we propose the Improved Attributed Node Random Walks (IANRW) framework which is based on attributed random walks [35]. Instead of learning individual embeddings for each type based on functions that map node attributes to types in attributed random walks, IANRW learns individual embeddings for each node. The goal of IANRW is to maximize the likelihood of preserving both the structures and attributes of nodes in networks. In IANRW, we define two kinds of neighbors. One is topology neighbor, another is attribute neighbor. The topology neighbors of a node $v$ are nodes that are connected to $v$ by an edge. While the attribute neighbors of a node $v$ are nodes whose attributes are similar to $v$ . In IANRW framework, we first find the attribute neighbors of nodes by their attributes. If a node has no attributes, its attribute neighbors are ‘None’. Second, we propose Improved Attributed Node Random Walks in attributed networks to generate topology neighbors and attribute neighbors with network semantics. Finally, we use the skip-gram model [15] to facilitate the modeling of geographically and semantically close nodes.

Our main contributions are defining a flexible notion of a node’s two kinds of neighbors and proposing a biased random walk between topology neighbors and attribute neighbors. Moreover, our method allows nodes to miss the links or attributes, which is suitable for incomplete networks. The proposed method is flexible by controlling over the search space through tunable parameters. The parameters can be easily interpreted and govern our search strategy. These parameters can be determined in advance or learned directly using a tiny fraction of labeled data. Our experiments mainly including multi-label classification, link prediction and node clustering tasks demonstrate that IANRW outperforms state-of-the-art methods on multi-label classification, link prediction and node clustering. IANRW is also robust to perturbations in the form of missing edges and missing attributes of nodes.

To summarize, our work makes the following contributions:

We propose IANRW, an efficient scalable framework for attributed network embedding. The framework flexibly incorporates both network topology and node attributes information, and it can be easily interpreted and govern our search strategy.

To utilize the attribute similarity information, we define two kinds of neighbors. One is topology neighbor, another is attribute neighbor. These two kinds of neighbors make our random walk easier in attributed networks.

We show how IANRW can easily deal with missing data, regardless of missing node’s edges or node’s attributes.

We extensively evaluate our approach through multi-class classification, link prediction and node clustering on several real-world datasets. Experimental results show the superior performance of IANRW over state-of-the-art embedding methods.

The rest of the paper is structured as follows. In Section 2, we survey related work in network representation learning. Section 3 defines the problem of improved attributed node random walks. Section 4 introduces the details of the proposed IANRW framework. Section 5 presents the experimental results. Finally, we conclude this work and vision the future works in Section 6.

2. Related work

Network embedding aims to learn a distributed representation for each node in a network. Most of earlier works mainly use matrix factorization techniques. They extract the leading eigenvectors of affinity matrix as the representations of nodes [16]. provides a computationally efficient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. CENI [17] adopts clustering method on the projected space can better preserve the structure hidden in the cascades and generate more accurately inferred links [18] provides some new theoretical results on random projections, which is a randomized approximate algorithm widely used in machine learning and data mining. LLE [19] maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima [20] efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

Recently, the Skip-gram model [21, 22] which is originally introduced for learning vector representations of words has been used for network embedding. DeepWalk [23] is first to utilize random walks to get a set of node sequences to learn the network representations by Skip-gram model. Tang et al. propose LINE [24], which preserves both the local and global structures information of networks by using first-order proximity and second-order proximity. GraRep [1], which can be regarded as an extension of LINE, considers high-order information. SDNE [25] is the first deep model which preserves both the first and second order proximity to capture the highly nonlinear structural information of networks. Node2Vec [26] introduces hyperparameters to tune the explored search space, which is a mixture of breadth-first and width-first search procedures and is flexible to sample nodes from a network. Most recently [27], presents a semi-supervised learning framework based on network embeddings. It uses deep learning to enhance the learned embedding.

However, the above discussed methods cannot effectively handle multi-view data such as node attributes. Zhu et al. [28]. combined both the links and node attributes, and proposed a matrix factorization model to learn embeddings. TADW [14] incorporates the text features of nodes into network embedding process. Huang et al. [29]. proposed an approach for attributed networks, which combines with the label information to help learn better feature representation. NEEC [30] is a general framework for attributed network embedding. Instead of directly modeling expert cognition, NEEC learns it from the oracle by performing a number of concise but effective queries. MVC-DNE [31] incorporates both the network structures and the node properties based on deep network embedding methods, which is the first time to study the problem of network embedding on incomplete networks. GCN [32] is designed for semi-supervised learning, which requires the full graph Laplacian information during training. Recently, Hamilton et al. proposed an inductive network embedding method GraphSAGE [33], which can be generalized to unseen nodes. SEANO [34] is designed to work in both transductive and inductive settings while explicitly alleviating noise effects from outliers. Nesreen et al. [35, 36, 37] proposed a flexible framework based on the notion of attributed random walks. Instead of learning individual embeddings for each node, it learns embeddings for each type based on functions that map node attributes to types. However, as it learns embeddings for each type, nodes with the same attributes will get the same embeddings, which is unreasonable for nodes with the same attributes but very different topologies. Furthermore, this method cannot handle missing data well, such as missing edges or attributes of nodes. In this paper, we propose an improved attributed node random walks framework which is based on attributed random walks. It flexibly incorporates both network topology and node attributes information and can easily deal with missing data.

3. Problem definition and preliminaries

In this section, we will define our research problems formally. Attributed network $G$ is defined as $G=(V,T,A)$ , where $V=\{v_{1},v_{1},\ldots,v_{n}\}$ denotes the nodes in the network, $n$ denotes the number of nodes. $T\in\ {\mathbb{R}}^{n\times n}$ represents the adjacency matrix of the network. $A\in{\mathbb{R}}^{|V|\times|V|}$ is the attribute matrix of nodes. Here $|V|$ represents the number of nodes with attributes, $|A|$ is the dimensions of the attribute. We emphasize that our model allows nodes to miss edges or attributes.

Definition 1. Topology Neighbors and Attribute Neighbors. Given a network $G=(V,T,A)$ , the topology neighbors of a node $v$ are nodes that are connected to $v$ by an edge, which denoted by $N(v)$ . We cluster the nodes according to the attributes of the nodes by an appropriate method. The attribute neighbors of a node $v$ are defined as all the nodes belonging to the same cluster with node $v$ . We denote it by $A(v)$ . Note that different clustering methods may get different attribute neighbors. As the attributes are used as auxiliary information for network representation learning, the general classical clustering algorithms can always get good results.

Topology neighbors and attribute neighbors describe nodes from two different views. When clustering nodes based on attributes, we can choose any suitable clustering method.

Definition 2. Attributed Network Embedding. Given a network $G=(V,T,A)$ , the task is to learn a matrix $X\in{\mathbb{R}}^{|V|\times d}$ , where $d$ is the number of latent dimensions with $d\ll|V|$ . The $i$ th row of $X$ denotes the embedding vector of node $i$ . We aim to make the learned representation vectors explicitly preserve both the network topology and node attributes information.

The main challenge of our problem is how to use the attributes of nodes. It is hard to directly apply the attributes to the previous network embedding models. Previous works preserve the proximity between a node and its neighborhood. In an attributed network, how do we define and model this ‘node-neighborhood’ concept? Furthermore, how do we optimize the embedding model that effectively maintains the structures and attributes of nodes?

Nesreen et al. proposed a flexible framework based on the notion of attributed random walks [35, 36, 37]. The framework consists of two general steps:

Function mapping nodes to types: The first step is to learn a function $\emptyset$ that maps nodes to types based on attributes. Given an (un)directed graph $G=(V,T,A)$ , the first component of the framework maps the $N$ nodes to a set $W=\{w_{1},\ldots,w_{M}\}$ of $M$ types where $1\leqslant M\leqslant N$ and $M$ is often much smaller than $N$ . The set of types $W$ is defined as: $W=\Phi(x_{1})\cup\Phi(x_{2})\ldots\cup\Phi(x_{N})$ , where $\Phi$ is a function that maps the attributes of nodes to $M=|W|$ types such that $1\leqslant M\leqslant N$ .

Attributed random walks: The second step uses the types derived by the function $\emptyset$ for generating attributed random walks. An attributed walk $S$ of length $L$ is defined as a sequence of adjacent node types $\emptyset\left(x_{i1}\right),\ \emptyset(x_{i2}),\ldots,\emptyset(x_{iL+1})$ associated with a sequence of indices $i_{1},i_{2},\ldots,i_{L+1}$ such that $(v_{it},v_{it+1})\in E$ for all $1\leqslant t\leqslant L$ .

The set of attributed random walks can then be given as input into the Skip-Gram model to learn embeddings for the node types instead of the nodes themselves.

Skip-gram model: Skip-Gram is a model used to produce word embeddings. It takes a large corpus of text and produces a corresponding vector for each word in the corpus. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the vector space. As shown in Fig. 2, The input $w_{i}$ to the model is a one-hot vector representing the input word, and the output $w_{i-1}$ , $w_{i-2}$ , $w_{i+1}$ and $w_{i+2}$ are also one-hot vectors representing the output words. The task is predicting its context given a word.

Figure 2.

Skip-gram model.

4. Improved attributed node random walks framework

In this section, we present the details of the proposed IANRW framework. First, we introduce our model framework. Second, the objective function contains both the topology information and attributes information is introduced in this framework. Finally, we present the specific details of IANRW and discuss some practical issues.

4.1 Framework

Figure 3 shows the framework of IANRW. We can see that each node in the network is associated with attributes. In fact, the model allows some nodes to lose attributes information. Our IANRW framework mainly contains three steps. First, topology matrix and attribute matrix are constructed from the given network. Topology matrix is the adjacency matrix. Each row in attribute matrix presents the attributes of a node. Second, nodes are clustered according to their attributes or using a function $\emptyset$ that maps nodes to types as the attributed random walks does [35]. How to cluster the nodes is not the focus of this paper, as it can use any suitable clustering method. The attribute neighbors of nodes are obtained according to the definition of attribute neighbors. Then we apply our improved attributed node random walks to generate node sequences, which contain both topology information and attribute information, and can be seen nodes and their neighborhood (context). Finally, the skip-gram model is to be used to learn the nodes representation.

Figure 3.

Framework of IANRW.

4.2 Attributed network embedding

Given a text corpus, word2vec [15] model learns the distributed representations of words in a corpus. Inspired by it, DeepWalk [23] and node2vec [26] extended the Skip-gram model to networks. Both methods leverage random walks to generate node sequences and aim to map the word-context concept into the node sequences. Here, we still apply this idea to attributed networks. Given a network $G=(V,T,A)$ , for every node $u\in V$ , $N_{S}(u)$ denotes the network neighborhood of node $u$ generated through a neighborhood sampling strategy $S$ . In this paper, $N_{S}(u)$ contains the topology neighbors and attribute neighbors of node $u$ , strategy $S$ denotes the improved attributed node random walks strategy. Then the objective is to maximize the probability of observing a network neighborhood $N_{S}(u)$ [8, 15, 23], which contains both the topology neighbors and attribute neighbors of node $u$ :

$\displaystyle\textit{arg max}_{\theta}\prod_{u\in V}p(N_{S}(u)|u;\theta)$ (1)

where $p(N_{S}(u)|u;\theta)$ defines the conditional probability of a node neighborhood $N_{S}(u)$ . To make the optimization problem tractable. We assume that the likelihood of a neighborhood node is independent of any other neighborhood node. Hence, the conditional probability can be rewritten as follows:

$\displaystyle p(N_{S}(u)|u;\theta)=\!\!\!\!\prod_{v_{i}\in N_{S}(u)}p(v_{i}|u;\theta)$ (2)

The objective function Eq. (1) can be rewritten as follows by replacing $p(N_{S}(u)|u;\theta)$ with Eq. (2) and taking the logarithm:

$\displaystyle\textit{arg max}_{\theta}\sum_{u\in V}\sum_{v_{i}\in N_{S}(u)}% \textit{log}\ p(v_{i}|u;\theta)$ (3)

where $p(v_{i}|u;\theta)$ is commonly defined as a softmax function [15]:

$\displaystyle p(v_{i}|u;\theta)=\frac{e^{X_{v_{i}}\cdot X_{u}}}{\Sigma_{n\in V% }{e^{X_{n}\cdot X_{u}}}}$

where $X_{u}$ represents the embedding vector for node $u$ . Given the node sequences, the notion of neighborhood can be naturally defined using a sliding window over consecutive nodes. $N_{S}(u)$ has vastly different structures depending on the sampling strategy $S$ .

4.3 Improved attributed node random walks

In random walks, given a source node $u$ , let $v_{i}$ denote the $i$ th node in the walk. The distribution of generating $v_{i}$ is the following:

$\displaystyle{\bf P}(v_{i}|v_{i-1})=\left\{\begin{array}[]{ll}\frac{{\pi}_{v_{% i-1}v_{i}}}{Z}&\textit{if}\ (v_{i},v_{i-1})\in E\\ 0&\textit{otherwise}\end{array}\right.$

where $E$ is the edges in the network, ${\pi}_{v_{i-1}v_{i}}$ is the unnormalized transition probability between nodes $v_{i-1}$ and $v_{i}$ . Usually, ${\pi}_{v_{i-1}v_{i}}=w_{v_{i}v_{i-1}}$ , where $w$ is the edge weights. In unweighted networks, $w_{v_{i}v_{i-1}}=1$ . $Z$ is the normalizing constant. Random walks only consider the topology neighbors of nodes. In order to contain the attribute neighbors, we propose the improved attributed node random walks. The distribution of generating $v_{i}$ is the following:

$\displaystyle{\bf P}(v_{i}|v_{i-1})=\left\{\begin{array}[]{ll}\frac{{\pi}_{v_{% i-1}v_{i}}}{Z}&\textit{if}\ v_{i}\in N\left(v_{i-1}\right)\ or\ v_{i}\in A% \left(v_{i-1}\right)\\ 0&\textit{otherwise}\end{array}\right.$

However, this does not allow us to account for the network structure and guide our search procedure to explore different types of network neighborhoods. Intuitively, the nodes that belong to a node’s topology neighbors and the attribute neighbors simultaneously should be more similarities with it, which should be easier to walk to. Additionally, in real-world networks, attributes and topology are not always equally important. Our random walks should accommodate for the fact in different networks. Based on the above facts, we define the unnormalized transition probability in networks as the following:

$\displaystyle{\pi}_{v_{i-1}v_{i}}=\left\{\begin{array}[]{ll}\frac{1}{p}\cdot% \frac{1}{|N(v_{i-1})|-|N\left(v_{i-1}\right)\cap A\left(v_{i-1}\right)|}&% \textit{if}\ v_{i}\in N\left(v_{i-1}\right)\ \textit{and}\ v_{i}\notin A\left(% v_{i-1}\right)\\ \frac{1}{|N\left(v_{i-1}\right)\cap A\left(v_{i-1}\right)|}&\textit{if}\ v_{i}% \in N\left(v_{i-1}\right)\ \textit{and}\ v_{i}\in A\left(v_{i-1}\right)\\ \frac{1}{q}\cdot\frac{1}{\left|A\left(v_{i-1}\right)\right|-\left|N\left(v_{i-% 1}\right)\cap A\left(v_{i-1}\right)\right|}&\textit{if}\ v_{i}\in A\left(v_{i-% 1}\right)\ \textit{and}\ v_{i}\notin N\left(v_{i-1}\right)\end{array}\right.$

where $|\cdot|$ denotes the number of nodes in it. We define two parameters $p$ and $q$ to control the walking process. The walking details are showed in Fig. 4. The blue node is the source node $v_{i-1}$ . We divide the neighbors of node $v_{i-1}$ into three parts. The black nodes are its topology neighbors but not attribute neighbors. The green nodes are its attribute neighbors but not topology neighbors. The last part is the red nodes which belong to both the topology neighbors and attribute neighbors of node $v_{i-1}$ . Consider a random walk that just resides at node $v_{i-1}$ . The walk now needs to decide on the next step node $v_{i}$ . The next step must be chosen from the {black, green, red}. We set the unnormalized probability to $\frac{1}{p}$ , $\frac{1}{q}$ , $1$ , respectively. Intuitively, the parameters $p$ and $q$ control the preference of random walk between attributes neighbors and topology neighbors. In particular, parameters $p$ and $p$ allow the walking procedure to interpolate between attributes neighbors and topology neighbors, which reflects an affinity for different notions of node equivalences.

Figure 4.

Illustration of the random walk procedure in IANRW. The walk now resides at node u and valuating its next step. Edge labels indicate search biases.

Parameter ${\bm{p}}$ and ${\bm{q}}$ . Parameter $p$ and $q$ control the likelihood of the topology neighbors and attributes neighbors in the walk respectively. As we hope the nodes that belong to both attribute neighbors and topology neighbors are more likely to be sampled, we set $p>1$ and $q>1$ . Going back to Fig. 4. If $p<q$ , the random walk is biased towards the topology neighbors of node $v_{i-1}$ . Such walk obtains the topology view of the network. In contrast, if $p>q$ , the walk is more inclined to visit nodes which are similar with node $v_{i-1}$ in terms of attributes. Such behavior obtains the attributes view of the network. Note that using different clustering methods may result in different attribute neighbors of the nodes. As in our model, we let the nodes that belong to both attributes neighbors and topology neighbors be more likely to be sampled. These nodes often have obvious similarities in attributes and can always be clustered in one.

Dealing with missing data. A significant advantage of our modes is that it works well with missing data. Going back to Fig. 4 again. If node $v_{i-1}$ misses the attributes data, it will have no attributes neighbors according to the definition of attributes neighbors. Figure 4 will only have the black nodes part. Then the next step of node $v_{i-1}$ will randomly select a node from its topology neighbors. Similarly, when node $v_{i-1}$ loses its edges, the next node will be selected from the green nodes part. In extreme cases, if all the nodes have no attributes, our model will degenerate into DeepWalk model. Overall, our model can handle missing data flexibly. It maximally approximates the observed network.

Compared with the attributed node random walks. First, attributed node random walks learns embeddings for the node types instead of the nodes themselves. Nodes with the same attributes will get the same embeddings, which is unreasonable for nodes with the same attributes but very different topologies. While IANRW learns individual embeddings for each node. Second, IANRW uses parameter $p$ and $q$ to control the likelihood of the topology neighbors and attributes neighbors in the walk respectively. It can be easily interpreted and govern our search strategy. Attributed node random walks cannot control the sampling weights between the topology neighbors and attributes neighbors. Finally, IANRW can easily deal with missing data, regardless of missing node’s edges or node’s attributes and attributed node random walks cannot do it well.

Algorithm 1: Improved Attributed Node Random Walks Framework
Learning Features
procedure $G=(V,T,A)$ , embedding dimensions $d$ , walks per node $r$ , walk length $L$ , context size $w$ , parameter $p$ and $q$
1 Clustering nodes based on attribute $A$ and obtain the result $C$
2 Precompute transition probabilities $\pi$ using $G$ , $p$ , $q$ , and $C$
3 $G^{\prime}=(V,E,\pi)$
4 Initialize walks $S$ to $\mathrm{\emptyset}$
5 for iter $=$ 1 to $r$ do
6 for node $u$ in $V$ do
7 walk $=$ Improved Attributed Node Random Walks ( $G^{\prime}$ , $u$ , $L$ )
8 append walk to walks
9 ${\bm{X}}=$ StochasticGradientDescent ( $w$ , $d$ , walks)
10 return the learned node embeddings ${\bm{X}}$
Improved Attributed Node Random Walks
1 Initialize walk to $[u]$
2 For $l=$ 1 to $L$
3 $\Gamma_{i}=$ set of the topology neighbors and attribute neighbors for node walk $[$ - $1]$
4 $s=$ AliasSample ( $\Gamma_{i}$ , $\pi$ )
5 Append $s$ to walk
6 returnwalk

The pseudocode for IANRW is given in Algorithm 1. Algorithm 1 assumes the nodes clusters are learned apriori. At every step of the walk, sampling is done based on the transition probabilities $\pi_{v_{i-1}v_{i}}$ which is defined by parameter $p$ and $q$ . Note that, if the nodes miss the edges data or attributes data, walk can still go on normally.

5. Experiments

In this section, we demonstrate the efficacy of the presented IANRW frameworks for attributed network representation learning.

Datasets. We conduct experiments on two social networks and four paper citation networks. The two social networks are Flickr and Google+. In social networks, nodes refer to users and links represent friend relationships among users. In Google+ network, each node has 5476 dimensions features, such as institution, place, university. We choose the institution of each user as the categories and select top 20 popular institutions as the final categories. Flickr is an online community. People in the communities can share photos. Photographers can follow each other and form a network. The groups that photographers joined are the labels and the tags specified on the photos are the attributes information. The four paper citation networks are Cornell, Cora, Texas and Citeseer. Nodes in the networks refer to papers and links refer to the citation relationships among papers. Papers are classified according to the belonged domains. Features of papers are binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the papers. Table 1 shows the detailed information of the six datasets.

Baseline Methods. IANRW is measured against the following baseline methods:

1.
DeepWalk [6]: is a topology-only network embedding method, which combines random walks and skip-gram to learn network representations.
2.
Node2vec [8]: an algorithmic framework for learning feature representations for nodes in networks, which defines a flexible notion of a node’s network neighborhood. Node2vec designs a biased random walk procedure.
3.
Line [7]: is a network embedding model with the first order and second order proximity preserved.
4.
TADW [14]: is the first shallow model incorporates rich text information of the network.
5.
Grarep [1]: which can be regarded as an extension of LINE, considers high-order information.
6.
Attributed node random walks [35]: is an approach learns embeddings for each type based on functions that map feature vectors to types. In the experiment we abbreviated attributed node random walks as ANRW.

For all embedding methods, we follow the suggestions of the original papers to set the parameters of all these baselines. In IANRW, we use spectral clustering to divide the community and the number of communities divided equals real number of groups in the data set.

Table 1
Detailed information of the six datasets

Data Nodes Links Attributes Categories

Google+ 976 4275 5476 20

Flickr 7575 239738 12047 9

Cornell 195 286 1703 5

Cora 2708 10858 2708 7

Texas 187 328 1703 4

Citeseer 3312 4372 3703 6

5.1 Multi-class classification

Data	Nodes	Links	Attributes	Categories
Google+	976	4275	5476	20
Flickr	7575	239738	12047	9
Cornell	195	286	1703	5
Cora	2708	10858	2708	7
Texas	187	328	1703	4
Citeseer	3312	4372	3703	6

In real-world networks, only a subset of nodes in networks are labeled. Thus, node classification can be used to predict the labels of unlabeled nodes. In this section, the learned node representations by various network embedding methods are used as the input features of multi-class node classification model. We randomly sample a portion of labeled nodes as the training data, and the rest are the test data. The portion $P_{r}$ of training data varies from 10% to 50%. The LinearSVM package [38] is adopted to train the classifier. We run the experiment 10 times and average the performance of both the Micro-F1 and Macro-F1 as the results.

Tables 3 and 3 show the multi-class classification results of the six datasets. From the results, it is evident that IANRW consistently outperforms other baseline methods. On Cora, all the methods perform very well in addition LINE. Their F1score are more than 0.7. This is mainly because the nodes in Cora are densely connected. Methods based on edge connection can achieve good results. On Citeseer, Flickr and Google+, IANRW completely beat other methods. These networks are sparsely connected and attribute information can be very important for network representation learning. TADW is also for attributed network and also performs well. In the small network such as Cornell and Texas, our method IANRW completely beat all the other methods. Over all, we can see that IANRW has the best performance and TADW has quite good performance. However, TADW cannot be applied to large-scale networks. On the other hand, the methods only based on topology information cannot get a stable performance.

Parameter sensitivity. IANRW has a number of parameters like the node2vec algorithm. We conduct a sensitivity analysis of IANRW to these parameters on the Cornell and Google+ datasets. When a parameter is tested, all the other parameters assume default values. We fix the training proportion to 50% and measure the Micro-F1 score.

Figure 5 shows the classification performances with different parameters. The classification performances improve as $p$ increases and $q$ decreases. This demonstrates that attribute information is more important in both networks. As $p$ increases, the walk is more inclined to visit nodes which are similar in terms of attributes. While a low $p$ encourages topology exploration. IANRW can flexibly explore topology neighbors and attributes neighbors in networks.

Table 2
Classification performance on Cornell, Citeseer and Cora datasets

Methods	Micro (Cornell)			Macro (Cornell)			Micro (Citeseer)			Macro (Citeseer)			Micro (Cora)			Macro (Cora)
$P_{r}$	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%
DeepWalk	0.28	0.36	0.38	0.24	0.24	0.23	0.49	0.56	0.58	0.46	0.52	0.54	0.75	0.79	0.81	0.73	0.78	0.79
Line	0.27	0.36	0.38	0.15	0.15	0.15	0.25	0.29	0.31	0.23	0.26	0.27	0.38	0.44	0.47	0.28	0.40	0.42
Node2vec	0.27	0.41	0.41	0.20	0.23	0.24	0.48	0.57	0.58	0.45	0.53	0.53	0.76	0.79	0.81	0.74	0.77	0.79
TADW	0.36	0.54	0.58	0.23	0.26	0.31	0.56	0.61	0.63	0.53	0.58	0.61	0.76	0.80	0.81	0.73	0.77	0.79
Grarep	0.32	0.45	0.51	0.23	0.27	0.34	0.50	0.54	0.55	0.46	0.48	0.48	0.74	0.78	0.78	0.73	0.76	0.76
ANRW	0.57	0.62	0.63	0.38	0.41	0.43	0.37	0.37	0.38	0.16	0.47	0.47	0.37	0.38	0.38	0.23	0.26	0.27
IANRW	0.60	0.64	0.66	0.41	0.44	0.47	0.56	0.62	0.65	0.53	0.58	0.60	0.76	0.82	0.83	0.75	0.80	0.82

Table 3

Classification performance on Texas, Flickr and Google+ datasets

Methods	Micro (Texas)			Macro (Texas)			Micro (Flickr)			Macro (Flickr)			Micro (Google+)			Macro (Google+)
$P_{r}$	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%	10%	30%	50%
DeepWalk	0.45	0.50	0.51	0.21	0.24	0.25	0.38	0.48	0.51	0.38	0.47	0.49	0.52	0.60	0.63	0.30	0.34	0.36
Line	0.41	0.44	0.47	0.16	0.20	0.21	0.50	0.53	0.54	0.49	0.52	0.53	0.38	0.41	0.45	0.13	0.19	0.22
Node2vec	0.45	0.45	0.47	0.20	0.22	0.24	0.37	0.46	0.49	0.36	0.45	0.47	0.55	0.59	0.62	0.29	0.34	0.36
TADW	0.51	0.51	0.52	0.19	0.25	0.27	0.49	0.59	0.63	0.48	0.59	0.61	0.75	0.80	0.83	0.49	0.52	0.58
Grarep	0.53	0.59	0.64	0.26	0.35	0.41	0.49	0.54	0.55	0.48	0.53	0.54	0.55	0.62	0.64	0.30	0.33	0.35
ANRW	0.55	0.56	0.6	0.23	0.24	0.32	0.14	0.15	0.15	0.095	0.10	0.10	0.79	0.81	0.82	0.46	0.47	0.49
IANRW	0.67	0.67	0.68	0.37	0.43	0.46	0.50	0.62	0.65	0.50	0.59	0.62	0.79	0.82	0.85	0.49	0.55	0.59

Figure 5.

Parameter sensitivity on Cornell and Google+.

Figure 5.

continued.

In Fig. 5, it also shows how the number of features $d$ , number of communities $k$ divided based on attributes, walk length $L$ , number of walks $r$ , and the context size $w$ affect the performance. The number of communities $k$ has a certain effect on the result. On Cornell, the performance is getting worse as $k$ increases, while on Google+, the performance improves as $k$ increases. According to our experience, each community has an average of about 40 nodes can usually get best results. From Fig. 5, we can see that the other parameters are relatively little relevance to the classification task in our experiments. Over all, according to the analysis, IANRW are not sensitive to these parameters in addition $p$ and $q$ , and can be able to reach high performance under a cost-effective parameter choice.

5.2 Link prediction

In link prediction, the task is to predict the edges that will be added in the future time. We randomly remove 50% of edges in the network, which are considered as the positive samples. To obtain negative samples, we also sample an equal number of edges which are not in the network. Based on the residual network, we learn the node representation by different methods. Then we calculate the cosine similarity of each node pair. AUC [39] is used to evaluate the performance of methods as it can find the optimal threshold to predict positive and negative links automatically.

We summarize the link-prediction results in Table 4. One can see that IANRW outperforms all the baselines except on Google+. On Google+, the DeepWalk, Line and Node2vec slightly outperforms IANRW. On Flickr, the performance of DeepWalk and Node2vec are as good as IANRW. On Cornell, Citeseer, Cora and Texas, our method completely beat all the baselines, which demonstrates that IANRW is effective in learning good node vectors for the task of link prediction.

Table 4
Link prediction results (AUC) on the six datasets

Methods	Cornell	Citeseer	Cora	Texas	Flickr	Google+
DeepWalk	0.47	0.61	0.64	0.57	0.92	0.98
Line	0.52	0.54	0.54	0.46	0.71	0.98
Node2vec	0.47	0.60	0.65	0.56	0.92	0.98
TADW	0.59	0.80	0.74	0.59	0.63	0.94
Grarep	0.63	0.73	0.75	0.61	0.55	0.97
ANRW	0.50	0.66	0.62	0.46	0.49	0.73
IANRW	0.66	0.81	0.80	0.68	0.92	0.95

5.3 Node clustering

In this section, we also conduct node clustering tasks to measure the performance of the methods. We apply the learned embeddings to a clustering model. We leverage the $k$ -means algorithm to cluster the nodes and evaluate the clustering results in terms of normalized mutual information(NMI) [40]. We run the experiment 10 times and average the performance of NMI due to the sensitivity of $k$ -means on the initial values.

Table 5 shows the results. As we can see, IANRW achieves the best results in all networks. Especially in Sparse network such as Cornell and Texas, IANRW achieves about 30% improvement compared with the second-best results, which demonstrates the superior performance of our model for the task of node clustering. This is mainly because the attributes information is also important in this networks. A flexible combination of the links and attributes provides the best performance.

Table 5
Node clustering results (NMI) on the six datasets

Methods	Cornell	Citeseer	Cora	Texas	Flickr	Google+
DeepWalk	0.080	0.23	0.44	0.055	0.18	0.37
Line	0.036	0.02	0.07	0.022	0.18	0.22
Node2vec	0.062	0.20	0.44	0.056	0.16	0.36
TADW	0.092	0.18	0.46	0.053	0.11	0.54
Grarep	0.069	0.19	0.40	0.099	0.14	0.37
ANRW	0.36	0.21	0.14	0.23	0.017	0.64
IANRW	0.36	0.24	0.47	0.40	0.25	0.57

5.4 Experiment results on missing data

In this subsection, we conduct experiments to evaluate the performance of our proposed method on missing data. It includes nodes have no edges or have no attributes. To construct an incomplete network with missing links, we randomly sample a portion of nodes to remove their all edges. The portion $P_{r}$ varies from 10% to 50%. Then, we do the multi-class classification experiment by randomly sampling 10% labeled nodes as the training data, and the rest are the test data. Similarly, we use the same method to construct incomplete network with missing attributes and do the same multi-class classification experiment.

Table 6 shows the results of incomplete network with missing links. One can see that with the increase of removed link proportion, the classification accuracy score decrease. However, the decline trends of IANRW is smaller than other methods, and they tend to be stable finally. It demonstrates that our method can learn nodes representation well in incomplete network with missing links. Table 7 shows the results of incomplete network with missing attributes. We can see that IANRW consistently outperform baseline methods. From the two experiments, it demonstrates that IANRW can easily deal with missing data.

Table 6
Classification performance on Cora with missing links

	Micro			Macro
Methods	10%	30%	50%	10%	30%	50%
DeepWalk	0.58	0.52	0.40	0.55	0.48	0.29
Line	0.27	0.25	0.20	0.15	0.12	0.11
Node2vec	0.57	0.53	0.41	0.54	0.48	0.30
TADW	0.50	0.45	0.40	0.38	0.32	0.27
Grarep	0.55	0.50	0.38	0.43	0.40	0.32
ANRW	0.22	0.18	0.12	0.15	0.12	0.09
IANRW	0.68	0.65	0.63	0.56	0.53	0.50

Table 7

Classification performance on Cora with missing attributes

	Micro			Macro
Methods	10%	30%	50%	10%	30%	50%
DeepWalk	0.59	0.51	0.42	0.56	0.47	0.30
Line	0.28	0.25	0.19	0.16	0.13	0.10
Node2vec	0.57	0.51	0.42	0.53	0.46	0.30
TADW	0.61	0.58	0.51	0.55	0.51	0.48
Grarep	0.53	0.49	0.38	0.43	0.42	0.31
ANRW	0.57	0.51	0.39	0.55	0.46	0.28
IANRW	0.72	0.69	0.65	0.65	0.60	0.57

6. Conclusion

In this paper, we define the representation learning problem in attributed networks. To address the attributed network challenge, we propose the IANRW framework. The framework is developed from the attributed random walks and can be able to capture both the topology and attribute information. We formalize the improved attributed node random walks, which enable the Skip-gram-based maximization of the network probability in the context attributed nodes. The search strategy in IANRW is flexible exploring network neighborhood through parameters $p$ and $q$ . Futhermore, it can easily deal with missing data, regardless of missing node’s edges or attributes. Experimental results on the multi-class classification, link prediction and node clustering demonstrate the efficacy and efficiency of the proposed method. Our future works mainly include optimizations and extending the models to dynamic attributed networks.

Footnotes

Acknowledgments

Our work is supported by the National Key Research Development Program of China (No. 2017YFB0802800). The authors would like to thank the Editor-in-Chief and anonymous reviewers for their insightful and constructive commendations that have led to an improved version of this paper.

References

Cao

and Xu

, Grarep: Learning graph representations with global structural information, In KDD, 2015.

Gori

Monfardini

and Scarselli

, A new model for learning in graph domains, In IEEE International Joint Conference on Neural Networks, volume 2, 2005, pp. 729–734.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

and Duchesnay

, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011), 2825–2830.

Shervashidze

Schweitzer

Leeuwen

E.J.V.

Mehlhorn

and Borgwardt

K.M.

, Weisfeilerlehman graph kernels, Journal of Machine Learning Research 12 (2011), 2539–2561.

Siegal

, Nonparametric statistics for the behavioral sciences, McGraw-hill, 1956.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: Online learning of social representations, In: The ACM SIGKDD International Conference, ACM, 2014, pp. 701–710.

Tang

Wang

Zhang

Yan

and Mei

, Line: large-scale information network embedding, In: International Conference on World Wide Web, 2015, pp. 1067–1077.

Grover

and Leskovec

, node2vec: scalable feature learning for networks, In: The ACM SIGKDD International Conference, 2016.

Wang

Cui

and Zhu

, Structural deep network embedding, In: The ACM SIGKDD International Conference, 2016.

10.

Wang

Yang

and Zhang

>, PPNE: Property Preserving Network Embedding, In DASFAA. Springer, 2017, pp. 163–179.

11.

Wang

Yang

and Zhang

, Semi-Supervised Network Embedding, In DASFAA. Springer, 2017, pp. 131–147.

12.

Wang

Cui

Wang

Pei

Zhu

and Yang

, Community Preserving Network Embedding, In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9, 2017, San Francisco, California, USA, http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14589, 2017, pp. 203–209.

13.

Yang

Liu

Zhao

Sun

and Chang

E.Y.

, Network Representation Learning with Rich Text Information, In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015, pp. 2111–2117.

14.

Yang

Liu

Zhao

Sun

and Chang

E.Y.

, Network representation learning with rich text information, In: International Conference on Artificial Intelligence, 2015, pp. 2111–2117.

15.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, In NIPS’13, 2013, pp. 3111–3119.

16.

Belkin

and Niyogi

, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15(6) (2003), 1373–1396.

17.

Xie

Lin

Wang

and Yu Philip

, Clustering embedded approaches for efficient information network inference, Data Science and Engineering 1(1) (2016), 29–40.

18.

Hastie

T.J

and Church

K.W.

, Very sparse random projections, In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2006, pp. 287–296.

19.

Roweis

S.T.

and Saul

L.K.

, Nonlinear dimensionality reduction by locally linear embedding, Science 290(5500) (2000), 2323–2326.

20.

Tenenbaum

J.B.

De Silva

and Langford

J.C.

, A global geometric framework for nonlinear dimensionality reduction, Science 290(5500) (2000), 2319–2323.

21.

Cheng

Greaves

and Warren

, From n-gram to skip-gram to concgram, Int J of Corp Linguistics 11(4) (2006), 411–433.

22.

Lovász

, Random walks on graphs, Combinatorics 2 (1993), 1–46.

23.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: Online learning of social representations, In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 701–710.

24.

Tang

Wang

Zhang

Yan

and Mei

, Line: large-scale information network embedding, In: International Conference on World Wide Web, 2015, pp. 1067–1077.

25.

Wang

Cui

and Zhu

, Structural deep network embedding, In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 1225–1234.

26.

Grover

and Leskovec

, node2vec: Scalable feature learning for networks, In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 855–864.

27.

Yang

Cohen

and Salakhudinov

, Revisiting Semi-Supervised Learning with Graph Embeddings, In ICML, 2016, pp. 40–48.

28.

Zhu

Chi

and Gong

, Combining content and link for classification using matrix factorization, In SIGIR, 2007, pp. 487–494.

29.

Huang

, and Hu

, Label Informed Attributed Network Embedding, In WSDM, ACM, 2017, pp. 731–739.

30.

Xiao

et al., Exploring Expert Cognition for Attributed Network Embedding, Eleventh ACM International Conference on Web Search and Data Mining ACM, 2018, pp. 270–278.

31.

Yang

et al., From Properties to Links: Deep Network Embedding on Incomplete Graphs, ACM, 2017, pp. 367–376.

32.

Kipf

T.N.

and Welling

, Variational graph auto-encoders, In NIPS Workshop on Bayesian Deep Learning, 2016.

33.

Hamilton

W.L

Ying

and Leskovec

, Inductive representation learning on large graphs, In NIPS, 2017.

34.

Liang

Jacobs

Sun

et al., Semi-supervised Embedding in Attributed Network with Outliers, 2017.

35.

Ahmed

N.K.

Rossi

R.A.

Zhou

et al., A Framework for Generalizing Graph-based Representation Learning Methods, 2017.

36.

Ahmed

N.K.

Rossi

R.A.

Zhou

et al., Inductive Representation Learning in Large Attributed Graphs, 2017.

37.

Ahmed

N.K.

Rossi

Lee

J.B.

et al., Learning Role-based Graph Embeddings, 2018.

38.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

and Duchesnay

, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011), 2825–2830.

39.

Fawcett

, An introduction to roc analysis, Pattern Recogn Lett 27(8) (2006), 861–874.

40.

Sun

Han

Yan

P.S.

and Wu

, Pathsim: Meta path-based top-k similarity search in heterogeneous information networks, In VLDB’11, 2011, pp. 992–1003.

Attributed network representation learning via DeepWalk

Abstract

Keywords

1. Introduction

3. Problem definition and preliminaries

4.1 Framework

Table 2 Classification performance on Cornell, Citeseer and Cora datasets

Table 4 Link prediction results (AUC) on the six datasets

Table 5 Node clustering results (NMI) on the six datasets

Table 6 Classification performance on Cora with missing links

Footnotes

Acknowledgments

References

Table 2
Classification performance on Cornell, Citeseer and Cora datasets

Table 4
Link prediction results (AUC) on the six datasets

Table 5
Node clustering results (NMI) on the six datasets

Table 6
Classification performance on Cora with missing links