Feature learning for representing sparse networks based on random walks

Abstract

Identifying features to represent graphs such as social networks, protein graphs is increasingly common in both research and business communities, thanks to the fact that data has increased not only in quantity but also in complexity. This results in the graphs to be sparser because not all nodes are fully connected. In addition, if this whole graph is used as input data for learning algorithms e.g. neural network, a lot of training time will be required. Substantial efforts have been made to convert the graphs to better yet compact representations, among of which is graph embedding. The traditional methods used to map the original graph to its embedding representation had not yielded significant results until deep learning was invented. Many good approaches in this direction, as examples, are DeepWalk, node2vec. However, their general weakness is many important connections in the original graph could be lost. In this paper, we propose another approach to retain more edge information while ensuring the embedding graph is still sufficiently small, compared to the original one. Our experiment results show that the method also increases the accuracy of latter learning models.

Keywords

Feature learning embedding graph sparse network deep learning

1. Introduction

Social network and other types of network are increasingly popular in serving to store vast amounts of users’ personal information. In the last few decades, (social) network analysis has gained great research interests. One of the challenges is the size and complexity of the network keep increasing rapidly, requiring algorithms to consume more time and resources to process. Therefore, it is our goal that we must reduce network data size but retain important information, especially the structure of network. Almost all algorithms in machine learning require input data as a form of features [16]. Reducing network data size, thus, focuses on extracting good features. Some related terms in this work are feature engineering, feature learning, representation learning or graph embedding [12, 26, 34]. Among those, the term network embedding is most used. The features should be highly compact, informative and representative so that they will be used as good inputs for latter tasks such as classification, clustering, link prediction [25] and so on. Many studies have shown that network embedding is a promising tool. For instance, it can be used in node classification [6, 22] to assign labels for the nodes in the network. In social networks, this method improves algorithms predicting users’ interest and their friendships [18]. In protein-protein interaction networks, predicting the function of proteins in the interactome is made possible due to the use of network embedding [27, 33]. There are also many other fields in which this method has been successfully applied [26].

In machine learning, it is usually required that input is of mathematical representation (vector, matrix, graph, etc.) in order to be properly processed. Feature learning is a set of techniques transforming raw data into the needed mathematical representations. The traditional feature learning methods such as topological relation [15], subgraph isomorphic [4] does not only take high costs for training but also depend on domain knowledge. Moreover, these features are often not general. Thus, we need a more effective approach to learn features, specifically learning automatically and the result is more general.

Regarding the dependence on information from nodes and edges/links in extracting features, one challenge worth noting is that feature space size should be small enough, but the quality of representative information should be good enough for latter network mining algorithms. One other difficulty in the conversion of the original network into this feature space is, while the networks grow, they often tend to increase the number of nodes, but those nodes do not necessarily connect to all/almost nodes in the network. This issue results in sparse graphs [3]. One way to mitigate such issue is to make the feature larger but this is not useful for training models. Therefore, in this paper, we concentrate on improving the feature extraction model for graph data, especially sparse graphs.

2. Related works

The methods converting to graph embedding focus on dimensionality reduction techniques. The graph is rendered in a d-dimensional vector space (with d much smaller than the size of the graph). The node of the graph is embedded in this compact space so that the pairwise node proximity is still maintained in the new space. It means two nodes will be still able to belong to the same cluster as the embedded one. These methods are generally referred to as a factorization-based method. Some typical methods of the factorization are Laplacian Eigenmaps [20] and Locally Linear Embedding [31]. However, the scalability, as well as the complexity, are the main issues in this approach.

From 2014 onwards, the studies on embedding graphs which originally serve for the graph can be extended to well adapt to the sparse nature of networks in practice. The idea is to apply deep learning to the graph to automatically extract good features. Many new studies with substantially improved time-complexity, compared to previous methods, are such as HOPE [23], DeepWalk [5], node2vec [2], DNGR [29], etc.

In general, those methods can be divided into two groups with some important branches as shown in Fig. 1. We will review the main characteristics, strengths, and weaknesses of each group in the next subsections.

Figure 1.

The main approaches in graph embedding.

2.1 Factorization based methods

The main idea of factorization-based algorithms is to find a way to represent the association between vertices in the form of a matrix and then factorize this matrix into the embedded space. The matrix form is often chosen to represent the relation of vertices in graphs such as adjacency matrix [9], Laplacian matrix [10], stochastic matrix [7], or KATZ similarity matrix. Matrix factorization depends on matrix properties. If the matrix is positive semidefinite (e.g. Laplacian matrix), it is possible to use eigenvalue analysis to process. For unstructured matrices, the gradient descent method can be used to embed the graph in linear time.

Locally Linear Embedding (LLE) [31] assumes that each vertex is a linear combination of its neighbors in embedded space. Initially, the authors select the direct neighbors of a specific vertex; then, they map this vertex to a new lower dimension space so that the embedding cost function can be minimized by finding the smallest eigenmodes of the sparse symmetric matrix. Other algorithms, e.g. Adjacency-based [1], Multi hop – GrapRe [28], Multi hop – HOPE [23] proposed new improvements in which graph factorization considered father neighbors in addition to direct neighbors. Thus, they retain more information on the original graph. However, most above algorithms have high time-complexity in case of large graphs and many neighbors.

2.2 Deep learning based methods

Nowadays, deep learning based methods have shown outstanding performance in much different research fields such as computer vision, natural language processing, so on. The idea of applying deep learning to identify features in network data is promising. It opens new opportunities to overcome the problems of factorization-based methods. Input data can either the entire graph or a part of it (e.g. paths are extracted from the graph).

Wang et al. [6] with structural deep network embedding method (SDNE) applies a deep autoencoder to preserve the first-order proximity and second-order proximity of the vertices when embedded. The authors optimize at the same time both types of order proximity. The composition of multiple layers of highly non-linear function in deep structure is used to map the data into an embedding space.

In the paper [29], Cao et al. adapt a random surfing along with stacked denoising autoencoder to extract complex features and model non-linearities. This method is named DNGR (deep neural networks for graph representations). The method includes three components: random surfing, calculation of PPMI matrix and feature reduction. The positive pointwise mutual information (PMI) is used to ensure an autoencoder can capture the graph’s information in higher order proximity. Moreover, autoencoder in stacked structure explored via deep learning can effectively reduce noise and enhance robustness in generating informative representations. Although there is an improvement in reducing the dimension of input data compared to SDNE, the DNGR method also encounters issues such as high time complexity as well as optimal capacity in sparse spaces.

DeepWalk [5] builds a multi-label network classification based on learning embedded features. To retain high order proximity, Perozzi et al. maximize the probability of observing the last node and the next $k$ nodes in Random-walk [8, 21] from the specific node $v$ . After that, SkipGram algorithm [5, 32] in natural language processing field (NLP) is applied to maximize the co-occurrence probability among the nodes that appear within a window in the path. DeepWalk paves the way for further studies related to the application of deep learning in feature learning as well as sampling paths to create smaller input data in place of a whole graph.

Similar to DeepWalk approach, node2vec [2] preserves high order proximity by maximizing the log-probability of observing a node chain on a random walk of fixed length. The important difference, compared to DeepWalk, is that node2vec designed with a biased random walk procedure has balanced the cost of time complexity between breadth-first sampling and depth-first sampling. In other words, node2vec achieves a balance in observing between local and global information. For this reason, node2vec can create higher quality and more information embedded features than DeepWalk.

However, in these works, there are still problems having not been solved including:

•
The real data is usually a sparse network, greatly affecting the efficiency of the model.
•
The training phase in order to have good features usually is time-demanding, especially for large graphs. Overcoming this difficulty is challenging in terms of practical usability and network scalability.

In summary, this has motivated us to develop a method that can run on sparse network data while also reducing training time. In the next section, we will present our method for these problems.
3. Feature learning model

Our approach is built on DeepWalk algorithm proposed in 2014 and then improved by node2vec in 2016. These were the early algorithms to adapt the SkipGram model from NLP algorithm in feature learning. Our model has three phases shown in Fig. 2. In the first phase, we do a sampling of nodes from the original graph. Next, we apply deep learning to extract features. After that, we use the extracted features as input data for multi-label classification [11].

Figure 2.

Multi-label classification system with feature learning.

Our improvements are mainly in the first two phases. Specifically, we focus on creating better graph representation and speeding up the feature learning process. In the third phase, we simply use one-vs-rest logistic regression implemented by LibLinear [28] to evaluate the effectiveness of our improvements. We will describe these phases in detail in the following subsections.

3.1 Sampling strategy

Like the DeepWalk and node2vec algorithms, we also use the random walk for sampling nodes in the graph. In this step, input data in the graph will be converted into smaller samples but still retain the structure information.

In details, with the given node $u$ , we find a random walk of fixed length $l$ and $c_{i}$ is the $i$ th node in the path. Then node $c_{i}$ will have the probability:

$\displaystyle P\left({c_{i}=x|c_{i-1}=v}\right)=\left\{{{\begin{array}[]{ll}{% \frac{\pi_{vx}}{Z}}&{\textit{if }\left({v,x}\right)\in E}\\ 0&\textit{otherwise}\\ \end{array}}}\right.$ (1)

where $\pi_{vx}$ is the transition probability between nodes $v$ and $x$ , and $Z$ is the normalizing constant.

In DeepWalk and node2vec algorithms, to ensure randomness without missing too many edges from the original graph, the authors choose a sampling strategy in which several random walks ( $r$ walks) will be taken for each node in the network. However, this can still lead to a proportion of edges that will not be considered. We can easily prove that the degree of any node $u$ is greater than the number of random walks (i.e. $\textit{deg}\left(u\right)>r$ ). If the ignored edges carry important information, the acquired features will be affected. If authors adjust the sampling strategy so that the number of random walks must be greater than the largest degree of nodes in the graph ( $r\geqslant\max\left({\textit{deg}\left(u\right)}\right)$ ), there will be too many paths but it is not sure all the edges in the graph are selected because of randomness. This is also contrary to the sampling strategy.

As we mentioned in the previous sections, an important property is that the network is sparse. We rely on this property to propose a new sampling strategy. Instead of focusing on all the nodes in the network, we will take the sample based on the edge set so that each edge is selected at least one time. To do that, we take turns in selecting each edge and make the sampling procedure on this edge by random walk approach. In this way, all nodes are guaranteed to be selected. Our sampling strategy is described in Algorithm 1.

Algorithm 1 EdgeRandomWalk ( $G$ , $x$ , $y$ , $l$ )
Input: graph $G\left({V,E}\right)$
$x$ : first point of center edge
$y$ : second point of center edge
$l$ : walk length
Output: random walk $W$
1: halfWalk1 $=$ RandomWalk ( $x$ , $l/2$ )
2: Marking all edges in halfWalk1 to increase the ability to select other edges for halfWalk2
3: halfWalk2 $=$ RandomWalk ( $y$ , $l-l/2-1$ )
4: walk $=$ halfWalk1 $+$ ( $x$ , $y$ ) $+$ halfWalk2

Another variation in our method is that for each edge we only perform sampling once instead of many times in the previous methods. That does not affect the quality of the sample set taken. Noticeably, our main goal is to need multiple paths to pass through one node. These paths next will be input data for the feature learning model and the output then is the embedded feature of each node for later machine learning problems.

At first glance, our sampling strategy will increase the size of the sample set. This may be true for dense graphs but in sparse graphs, our method is still at a reasonable size level. To clarify this point, we compare the magnitude of the sample set between node sampling method and edge sampling method in a sparse network.

Let $G\left({V,E}\right)$ be the given sparse network where $V$ is set of nodes and $E$ is a set of edges in the network. Let $r_{v}$ and $r_{e}$ be in turn the number of samples is selected for each node and each edge. By simulating a random walk of length $l$ , we calculate the total number of edges selected into feature space in the Eqs (2) and (3).

With node sampling method, the number of edges is:

$\displaystyle S_{ev}=r_{v}.\left({l-1}\right).\left|V\right|$ (2)

With the edge sampling method, the number of edges is:

$\displaystyle S_{ee}=r_{e}.\left({l-1}\right).\left|E\right|$ (3)

For the recommended configuration, in DeepWalk and node2vec approaches, $r_{v}$ is often set to 10 for each node. Meanwhile, in our sampling approach, we just get only one random walk for each edge so the value of $r_{e}$ is always fixed to 1. From that, we will get the ratio of total edges between node sampling method and edge sampling method as:

$\displaystyle\frac{S_{ev}}{S_{ee}}=\frac{r_{v}.\left({l-1}\right).\left|V% \right|}{r_{e}.\left({l-1}\right).\left|E\right|}=\frac{10\left|V\right|}{% \left|E\right|}$ (4)

In almost datasets we surveyed, the ratio of the number of edges and number of nodes is near $\left|E\right|<6\left|V\right|$ in a sparse network. Therefore, along with Eq. (4), we can deduce that:

$\displaystyle\frac{S_{ev}}{S_{ee}}=\frac{10\left|V\right|}{\left|E\right|}>% \frac{6\left|V\right|}{\left|E\right|}>1$ (5)

This ratio function proves that the size of the sample set in our approach is generally of smaller value. In other words, our sampling strategy will be more effective than the other mentioned sampling strategies in sparse networks.

After completing the sampling, we will bring these samples into feature learning procedure to discover features or representations of each node in the network. The random walk at node $v_{i}$ is denoted $w_{v_{i}}$ . Gathering all the walks of a node will get an embedded matrix $W_{v_{i}}$ . The deep learning architectures are used for this feature learning.

3.2 Feature learning

We modify SkipGram [5] to learn the features more quickly. Input data for SkipGram model is a sample set which includes random walks. Because SkipGram comes from NLP in which input data are sentences in documents, we treat each random walk as a sentence so that they can be fed into the algorithm. Each node on the walk corresponds to a word and the order of the nodes is also the order of the words in the sentence.

Given graph $G\left({V,E}\right)$ , a walk $w$ with length $l$ is denoted as $w=\left({v_{1},v_{2},\ldots,v_{l}}\right)$ where $v_{i}$ is a node in walk $w$ . In SkipGram model, the task is to maximize the co-occurrence probability among the nodes within a window of size $c$ in walk $w$ . The two nodes are considered co-occurring in the same window of size $c$ when the distance between them on the walk is not larger than $c$ . The co-occurrence probability is as follows:

$\displaystyle P\left({\left\{{v_{i-c},\ldots,v_{i+c}}\right\}\backslash v_{i}|% {\Phi}\left({v_{i}}\right)}\right)=\prod{}_{{\begin{array}[]{c}{j=i-c}\\ {j\neq i}\end{array}}}^{i+c}P(v_{j}|{\Phi}(v_{i}))$ (6)

where $\Phi\left({v_{i}}\right)$ is a mapping function represents the latent social representation associated with each vertex $v$ in the graph.

Maximizing the co-occurrence probability is maximizing the probability of its neighbors in the walk. However, this work in a large graph requires high computational cost. Alternatively, Hierarchical Softmax [5] is used to approximate the probability distribution. We put Softmax at the output layer to estimate the co-occurrence probability of a node with all its neighbors in the network.

In a nutshell, the Skipgram (Fig. 3) in a neural network includes three layers: the input layer, hidden layer, and output layer. At the input layer, each node $v_{i}$ in the network will be represented by a binary vector of length $V\vee$ and corresponding to a value of 1 at $i$ th position in the vector. Then, this vector is multiplied by the embedded matrix $W_{v_{i}}$ which we get from the sampling step. The result of this multiplication is a vector $h$ in the hidden layer. Next, we bring the vector $h$ into Hierarchical Softmax to approximate the probability distribution. After training, vector $h$ in the hidden layer is a feature vector that we would like to achieve. Later classifiers can use these features to learn a model for a specific problem.

Figure 3.

SkipGram model for feature learning.

Algorithm 2 EdgeSliding ( $G$ , $c$ , $d$ , $r$ , $l$ )
Input: graph $G\left({V,E}\right)$
$c$ : window size
$d$ : embedding size
$r$ : walks per edge $e\in E$
$l$ : walk length
Output: matrix of vertex representations $\Phi$
1: Initialization $\Phi$
2: Build a binary tree $T$ from $V$
3: for $i=$ 0 to $r$ do
4: edges $=$ Shuffle ( $E$ )
5: for $e\in$ edges do
6: $W_{e}=$ EdgeRandomWalk ( $G$ , $e . x$ , $e . y$ , $l$ )
7: SkipGramWithMGGD ( $T$ , $\Phi$ , $W_{e}$ , $c$ )
8: end for
9: end for

While implementing Softmax, we discovered a problem in which we could improve it. The optimization process is done by SGD (Stochastic Gradient Descent) [14]. SGD is not suitable for speeding up the algorithm by the parallel system. Authors in DeepWalk and node2vec converted SGD to asynchronous version (ASGD), so parallel can be applied by using multiple threads. However, it has not reached a high parallelism level. We try to change SGD by MGGD (Mini-Batch Gradient Descent) [30]. And we realize that the algorithm’s speed increased, compared to the old method. This will be proven in our experiments.

We summarize all our improvements in an algorithm named EdgeSliding (Algorithm 2). It is like DeepWalk and node2vec but there are some changes in the sampling step and feature learning step as described above.

4. Experiments and results

Because our approach is based on two well-known approaches: DeepWalk and node2vec, we will compare our work with theirs to show the improvements. The data sets, evaluation methods and parameters proposed by the authors are mostly used to ensure objectivity, generalizability, and reliability in the comparison process.

4.1 Datasets

Three published datasets are used for evaluation, including Blogcatalog [2], Protein-Protein Interactions [2], and CiteSeer [13]. These dataset’s attributes are presented in Table 1.

Table 1
Description about datasets where $|V|$ is number of vertices, $|E|$ is number of edges, $|L|$ is number of labels, and $|E|$ / $|V|$ is ratio between $|E|$ and $|V|$

	Blogcatalog	PPI	CiteSeer
$\left\|V\right\|$	10,312	3,890	3,312
$\left\|E\right\|$	333,983	76,584	4,732
$\|L\|$	39	50	6
$\left\|E\right\|/\left\|V\right\|$	32.38	19.68	1.42

Briefly, the main characteristics are as followed:

•

BlogCatalog: a social network that shows bloggers’ relationships on BlogCatalog website. Labels show bloggers’ concerns through links between bloggers. The network has 10,312 nodes, 333,983 edges, and 39 different labels.

•

Protein-Protein Interactions (PPI): a subgraph of the PPI network for Homo Sapiens in which nodes represent the interacting proteins and edges denotes all the detected pairwise interactions between proteins. The network has 3,890 nodes, 76,584 edges, and 50 different labels.

•

CiteSeer: a citation network between scientific publications in computer science. The labels in the dataset represent the research areas of the paper. The papers are separated into 6 categories: Agents, Artificial intelligence (AI), Database (DB), Information retrieval (IR), Machine learning (ML), and Human-computer interaction (HCI). The network consists of 3,312 nodes, 4,732 edges and 6 different labels. This dataset is a sparse network.

For each dataset, we use the same data organization as DeepWalk and node2vec. It means that we randomly select some labeled nodes for the training set, called $T_{R}$ . The rest is used as a test set. This process is repeated 10 times with the training set $T_{R}$ , respectively, accounting for 10% to 90% labeled data.

4.2 Evaluation metrics

In our experiment, we use Micro-F1 and Macro-F1 to evaluate the performance [24]. Their definitions are listed as follows:

•
Micro-F1 score is the harmonic average of the precision $P$ and the recall $R$ which are calculated through true positive $T P$ , false positive $F P$ and false negative $F N$ according to the formula:

$\displaystyle\textit{Micro-F1}=\frac{2\ast P\ast R}{P+R}$ (7)

where $P=\frac{\mathop{\sum}\nolimits_{l\in{\cal L}}TP\left(l\right)}{\mathop{\sum}% \nolimits_{l\in{\cal L}}\left({TP\left(l\right)+FP\left(l\right)}\right)}$ , and $R=\frac{\mathop{\sum}\nolimits_{l\in{\cal L}}TP\left(l\right)}{\mathop{\sum}% \nolimits_{l\in{\cal L}}\left({TP\left(l\right)+FN\left(l\right)}\right)}$ with $TP\left(l\right)$ , $FP\left(l\right)$ , $FN\left(l\right)$ is in turn the number of correct positive samples, the number of incorrect positive samples, and the number of incorrect negative samples of label $l$ .
•
Macro-F1 score is the average measure of $F1$ for all labels. It is defined as follows:

$\displaystyle\textit{Macro-F1}=\frac{\mathop{\sum}\nolimits_{l\in{\cal L}}F1% \left(l\right)}{\left|{\cal L}\right|}$ (8)

where $F1\left(l\right)$ is the $F1$ measure for the label $l$ and $F1=\left({2\ast P\ast R}\right)/\left({P+R}\right)$ , with $P$ is precision and $R$ is the recall of the label $l$ .

4.3 Parameter settings

Before the experiments, we need to set up parameters for each dataset. In DeepWalk and node2vec algorithms, we reuse almost parameters which authors chose as a best suitable parameter in their papers. Details of the parameters are listed as follows:

•
In DeepWalk, we use same parameters: $d=$ 128, $w=$ 10, $l=$ 80, $r_{v}=$ 10 for all dataset.
•
In node2vec, we use different parameters for each dataset as presented in Table 2 because they are the best parameters that Grover et al. [2] proposed.

Table 2
node2vec’s parameters for datasets

$d$ $w$ $l$ $r_{v}$ $p$ $q$

Blogcatalog 128 10 80 10 0.25 0.25

PPI 128 10 80 10 4 1

CiteSeer 128 10 80 10 0.5 1

•
In our approach, we choose parameters as presented in Table 3.

Table 3
Our parameters for datasets

$d$ $w$ $l$ $r_{e}$

Blogcatalog 128 8 55 1

PPI 128 8 55 1

CiteSeer 128 8 55 3

The meaning of parameters in the above models are:

•
d: the size of the embedding vector.
•
W: window size.
•
L: walk length.
•
$r_{v}$ : the number of walks on each vertex (in DeepWalk and node2vec models). This parameter should be higher than one because these models must limit the loss of structure information in the network.
•
$r_{e}$ : the number of walks on each edge (in our approach). This parameter often equals one because almost edges and vertices in the network are on a certain walk. For the CiteSeer dataset, we choose $r_{e}$ equal to 3 because the network is too sparse, and we want to increase the number of samples to be trained.
•
p and q: used to control the walk sampling process in node2vec.

Firstly, we tried applying parameters in the original algorithms and obtained some results. After that, we changed parameters such as window size, walk length to lower values in order to improve running time. Because our sampling strategy is on edge, we don’t need window size and walk length to be too high to get the full structure information of the network. We applied a k-fold cross validation and elbow method to identify the optimal parameters for our model.
4.4 Experiment results

	$d$	$w$	$l$	$r_{v}$	$p$	$q$
Blogcatalog	128	10	80	10	0.25	0.25
PPI	128	10	80	10	4	1
CiteSeer	128	10	80	10	0.5	1

	$d$	$w$	$l$	$r_{e}$
Blogcatalog	128	8	55	1
PPI	128	8	55	1
CiteSeer	128	8	55	3

We measured the speed of a learning process including the sum of the sampling time and the feature training time (SkipGram). The results are shown in detail in Fig. 4.

Figure 4.

Run time of DeepWalk, node2vec, and our EdgeSliding on three datasets.

In PPI and CiteSeer dataset, our approach has better run time. In BlogCatalog dataset, although our approach is slower than DeepWalk but still better than node2vec. Because the network in BlogCatalog is dense, our algorithm took too many samples. As a result, this needed a long time to learn features.

In general, it can be said that in the sparse networks, our approach has a substantially lower sum of the sampling time and the feature learning time than the other approaches.

Table 4

Results of multi-label classification for DeepWalk, node2vec, and our EdgeSliding on three sets of data according to Macro-F1 measurement with the ratio of labeled nodes $T_{R}$ from 10% to 90%

	Blogcatalog			PPI			CiteSeer
$T_{R}$	DeepWalk	node2vec	EdegeSliding	DeepWalk	node2vec	EdegeSliding	DeepWalk	node2vec	EdegeSliding
10%	0.1903	0.2109	0.2064	0.1293	0.1305	0.1368	0.4746	0.4833	0.5031
20%	0.2155	0.2331	0.2337	0.1538	0.1545	0.1619	0.5090	0.5154	0.5338
30%	0.2310	0.2448	0.2480	0.1664	0.1709	0.1733	0.5295	0.5269	0.5479
40%	0.2433	0.2562	0.2592	0.1726	0.1777	0.1832	0.5326	0.5410	0.5599
50%	0.2506	0.2648	0.2684	0.1803	0.1841	0.1924	0.5417	0.5427	0.5683
60%	0.2570	0.2694	0.2716	0.1896	0.1907	0.1952	0.5516	0.5525	0.5777
70%	0.2612	0.2727	0.2777	0.1957	0.1947	0.1996	0.5598	0.5531	0.5826
80%	0.2598	0.2757	0.2813	0.1920	0.1907	0.1986	0.5591	0.5605	0.5854
90%	0.2610	0.2718	0.2756	0.1931	0.1826	0.1975	0.5635	0.5711	0.5937

Table 5

Results of multi-label classification for DeepWalk, node2vec, and our EdgeSliding on three sets of data according to Micro-F1 measurement with the ratio of labeled nodes $T_{R}$ from 10% to 90%

	Blogcatalog			PPI			CiteSeer
$T_{R}$	DeepWalk	node2vec	EdegeSliding	DeepWalk	node2vec	EdegeSliding	DeepWalk	node2vec	EdegeSliding
10%	0.3366	0.3631	0.3606	0.1582	0.1651	0.1723	0.5142	0.5332	0.5540
20%	0.3618	0.3824	0.3825	0.1825	0.1858	0.1951	0.5508	0.5639	0.5842
30%	0.3754	0.3929	0.3937	0.1957	0.2016	0.2075	0.5746	0.5762	0.6013
40%	0.3872	0.4006	0.4026	0.2018	0.2092	0.2175	0.5789	0.5881	0.6089
50%	0.3944	0.4067	0.4082	0.2110	0.2155	0.2262	0.5892	0.5905	0.6186
60%	0.3988	0.4095	0.4109	0.2197	0.2236	0.2290	0.5991	0.6019	0.6254
70%	0.4036	0.4135	0.4143	0.2254	0.2303	0.2331	0.6073	0.6014	0.6289
80%	0.4082	0.4148	0.4185	0.2254	0.2288	0.2309	0.6104	0.6081	0.6365
90%	0.4057	0.4159	0.4210	0.2266	0.2290	0.2386	0.6145	0.6148	0.6377

Figure 5.

Macro-F1 score chart of DeepWalk, node2vec, and our EdgeSliding.

Figure 6.

Micro-F1 score chart of DeepWalk, node2vec, and our EdgeSliding.

Table 4 and Fig. 5 shows the results of Macro-F1 measurement for multi-label classification. Overall, our approach has a higher Macro-F1 score than DeepWalk and node2vec on all three datasets with different training ratios. According to results from Table 5, the proposed method also has a higher Micro-F1 score on all datasets (Fig. 6).

In conclusion, through experiments, we proved our approach has some encouraging results as followed:

•

Sampling strategy achieves good coverage on the dataset. Therefore, the feature vectors contain more information from the original data, helping improve the quality of latter machine learning.

•

Time to learn features is faster in the networks which have a low ratio between the number of edges and the number of vertices (i.e. the sparse graph). Specifically, we experimented on three datasets whose ratios decrease in the order: Blogcatalog $>$ PPI $>$ CiteSeer.

5. Conclusion and future work

In this paper, we proposed an approach to reduce the feature learning time and improve the quality of feature vectors, compared to DeepWalk and node2vec algorithms, especially on sparse networks. The overall idea is that instead of taking samples based on nodes, we take samples based on the edges of the network. This helps get almost all edges in the network. It means that more information can be retained. Although all edges are chosen, we do not lose the important characteristics of the random walk method. We also proved that our approach is scalable in large sparse networks. In addition to the quality of feature vectors, we applied other optimization functions to take advantage of parallel computing. The result is the runtime has improved. With many different datasets, experiments illustrated the effectiveness of the proposed approach for the multi-label classification.

In addition to scalability, our model is designed into modules independent of each other. Therefore, for future work, we could apply parallel computing to increase even more the speed of the learning process. Additionally, our approach is built on a linguistic model so any future improvement in natural language processing will also enable us to improve our algorithm.

Footnotes

Acknowledgments

This paper is supported by the Research and Development Support Program of Faculty of Information Technology, University of Science, VNU-HCMC.

References

Ahmed

Shervashidze

Narayanamurthy

Josifovski

and Smola

A.J.

, Distributed large-scale natural graph factorization, in: Proceedings of the 22nd International Conference on WWW, 2013.

Grover

and Leskovec

, node2vec: scalable feature learning for networks, in: Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, 2016.

Barabási

A.L.

, Network science, Cambridge university press, 2016.

Adhikari

Zhang

Ramakrishnan

and Prakash

B.A.

, Sub2vec: Feature learning for subgraphs, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham, 2018, pp. 170–182.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: online learning of social representations, in: Proceedings 20th International Conference on Knowledge Discovery and Data Mining, 2014.

Wang

Cui

and Zhu

, Structural deep network embedding, in: the 22nd International Conference on Knowledge Discovery, 2016.

Poole

D.G.

, The Stochastic Group, Amer. Math, Monthly 102, 1995, 798–801.

Fouss

Pirotte

Renders

J.-M.

amd Saerens

, Random-walk computation of similarities between nodes of a graph with application to colaborative recommendation, in: IEEE Transactions on Knowledge and Data, 2006.

Szabo

, The Linear Algebra Survival Guide: Illustrated with Mathematica, Elsevier Science, 2015.

10.

Chung

F.R.

and Graham

F.C.

, Spectral graph theory, No. 92. American Mathematical Soc., 1997.

11.

Tsoumakas

and Katakis

, Multi-label classification: an overview, International Journal of Data Warehousing and Mining (IJDWM) 3(3) (2007), 1–13.

12.

Cai

Zheng

V.W.

and Chang

K.C.

, A comprehensive survey of graph embedding: problems, techniques and applications, IEEE Transactions on Knowledge and Data Engineering 30(9) (2018), 1616–1637.

13.

Chen

Perozzi

and Skiena

, Harp: hierarchical representation learning for networks, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

14.

Robbins

and Monro

, A stochastic approximation method, the annals of mathematical statistics, 1951, 400–407.

15.

Xue

Peng

and Shang

, Deep Feature Learning of Multi-Network Topology for Node Classification, arXiv preprint arXiv:1809.02394, 2018.

16.

Guyon

Gunn

Nikravesh

and Zadeh

L.A.

, Feature extraction: foundations and applications, Springer 207 (2008).

17.

Leskovec

, Representation Learning on Networks, snap.stanford.edu/proj/embeddings-www, retrieved 27 March 2019.

18.

Backstrom

and Leskovec

, Supervised random walks: predicting and recommending links in social networks, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, 2011, pp. 635–644.

19.

Katz

, A new status index derived from sociometric analysis, Psychometrika 18(1) (1953), 39–43.

20.

Belkin

and Niyogi

, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Information Processing Systems, 2002, pp. 585–591.

21.

Newman

M.E.

, A measure of betweenness centrality based on random walks, Social Networks 27(1) (2005), 39–54.

22.

Jadon

and Hiteshri

, Multi-label classification methods: a comparative study, International Research Journal of Engineering and Technology (IRJET) 4(12) (2017), 263–270.

23.

Cui

Pei

Zhang

and Zhu

, Asymmetric transitivity preserving graph embedding, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 1105–1114.

24.

Sokolova

and Lapalme

, A systematic analysis of performance measures for classification tasks, Information Processing & Management 45(4) (2009), 427–437.

25.

N.T.

Ichise

and Le

H.B.

, Detecting hidden relations in geographic data, in: Proceedings of the 4th International Conference on Advances in Semantic Processing, 2010, pp. 61–68.

26.

Goyal

and Ferrara

, Graph embedding techniques, applications, and performance: a survey, Knowledge-Based Systems 151 (2018), 78–94.

27.

Radivojac

Clark

W.T.

Oron

T.R.

Schnoes

A.M.

Wittkop

Sokolov

Graim

Funk

, Verspoor, et al., A large-scale evaluation of computational protein function prediction, Nature Methods 10(3) (2013), 221–227.

28.

Cao

and Xu

, Grarep: learning graph representations with global structural information, in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, 2015, pp. 891–900.

29.

Cao

and Xu

, Deep neural networks for learning graph representations, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, 2016.

30.

Ruder

, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747, 2016.

31.

Roweis

S.T.

and Saul

L.K.

, Nonlinear dimensionality reduction by locally linear embedding, Science 5500 (2000), 2323–2326.

32.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv, 2013.

33.

Liu

Yang

Sang

Zhou

Wang

Zhang

Lin

Wang

and Xu

, Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks, BMC Bioinformatics 19(1) (2018), 332.

34.

Bengio

Courville

and Vincent

, Representation learning: a review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013), 1798–1828.

Feature learning for representing sparse networks based on random walks

Abstract

Keywords

1. Introduction

2. Related works

2.2 Deep learning based methods

4.1 Datasets

Table 1 Description about datasets where | V | is number of vertices, | E | is number of edges, | L | is number of labels, and | E | / | V | is ratio between | E | and | V |

Footnotes

Acknowledgments

References

Table 1
Description about datasets where $|V|$ is number of vertices, $|E|$ is number of edges, $|L|$ is number of labels, and $|E|$ / $|V|$ is ratio between $|E|$ and $|V|$