Abstract
Identifying features to represent graphs such as social networks, protein graphs is increasingly common in both research and business communities, thanks to the fact that data has increased not only in quantity but also in complexity. This results in the graphs to be sparser because not all nodes are fully connected. In addition, if this whole graph is used as input data for learning algorithms e.g. neural network, a lot of training time will be required. Substantial efforts have been made to convert the graphs to better yet compact representations, among of which is graph embedding. The traditional methods used to map the original graph to its embedding representation had not yielded significant results until deep learning was invented. Many good approaches in this direction, as examples, are DeepWalk, node2vec. However, their general weakness is many important connections in the original graph could be lost. In this paper, we propose another approach to retain more edge information while ensuring the embedding graph is still sufficiently small, compared to the original one. Our experiment results show that the method also increases the accuracy of latter learning models.
Introduction
Social network and other types of network are increasingly popular in serving to store vast amounts of users’ personal information. In the last few decades, (social) network analysis has gained great research interests. One of the challenges is the size and complexity of the network keep increasing rapidly, requiring algorithms to consume more time and resources to process. Therefore, it is our goal that we must reduce network data size but retain important information, especially the structure of network. Almost all algorithms in machine learning require input data as a form of features [16]. Reducing network data size, thus, focuses on extracting good features. Some related terms in this work are feature engineering, feature learning, representation learning or graph embedding [12, 26, 34]. Among those, the term network embedding is most used. The features should be highly compact, informative and representative so that they will be used as good inputs for latter tasks such as classification, clustering, link prediction [25] and so on. Many studies have shown that network embedding is a promising tool. For instance, it can be used in node classification [6, 22] to assign labels for the nodes in the network. In social networks, this method improves algorithms predicting users’ interest and their friendships [18]. In protein-protein interaction networks, predicting the function of proteins in the interactome is made possible due to the use of network embedding [27, 33]. There are also many other fields in which this method has been successfully applied [26].
In machine learning, it is usually required that input is of mathematical representation (vector, matrix, graph, etc.) in order to be properly processed. Feature learning is a set of techniques transforming raw data into the needed mathematical representations. The traditional feature learning methods such as topological relation [15], subgraph isomorphic [4] does not only take high costs for training but also depend on domain knowledge. Moreover, these features are often not general. Thus, we need a more effective approach to learn features, specifically learning automatically and the result is more general.
Regarding the dependence on information from nodes and edges/links in extracting features, one challenge worth noting is that feature space size should be small enough, but the quality of representative information should be good enough for latter network mining algorithms. One other difficulty in the conversion of the original network into this feature space is, while the networks grow, they often tend to increase the number of nodes, but those nodes do not necessarily connect to all/almost nodes in the network. This issue results in sparse graphs [3]. One way to mitigate such issue is to make the feature larger but this is not useful for training models. Therefore, in this paper, we concentrate on improving the feature extraction model for graph data, especially sparse graphs.
Related works
The methods converting to graph embedding focus on dimensionality reduction techniques. The graph is rendered in a d-dimensional vector space (with d much smaller than the size of the graph). The node of the graph is embedded in this compact space so that the pairwise node proximity is still maintained in the new space. It means two nodes will be still able to belong to the same cluster as the embedded one. These methods are generally referred to as a factorization-based method. Some typical methods of the factorization are Laplacian Eigenmaps [20] and Locally Linear Embedding [31]. However, the scalability, as well as the complexity, are the main issues in this approach.
From 2014 onwards, the studies on embedding graphs which originally serve for the graph can be extended to well adapt to the sparse nature of networks in practice. The idea is to apply deep learning to the graph to automatically extract good features. Many new studies with substantially improved time-complexity, compared to previous methods, are such as HOPE [23], DeepWalk [5], node2vec [2], DNGR [29], etc.
In general, those methods can be divided into two groups with some important branches as shown in Fig. 1. We will review the main characteristics, strengths, and weaknesses of each group in the next subsections.
The main approaches in graph embedding.
The main idea of factorization-based algorithms is to find a way to represent the association between vertices in the form of a matrix and then factorize this matrix into the embedded space. The matrix form is often chosen to represent the relation of vertices in graphs such as adjacency matrix [9], Laplacian matrix [10], stochastic matrix [7], or KATZ similarity matrix. Matrix factorization depends on matrix properties. If the matrix is positive semidefinite (e.g. Laplacian matrix), it is possible to use eigenvalue analysis to process. For unstructured matrices, the gradient descent method can be used to embed the graph in linear time.
Locally Linear Embedding (LLE) [31] assumes that each vertex is a linear combination of its neighbors in embedded space. Initially, the authors select the direct neighbors of a specific vertex; then, they map this vertex to a new lower dimension space so that the embedding cost function can be minimized by finding the smallest eigenmodes of the sparse symmetric matrix. Other algorithms, e.g. Adjacency-based [1], Multi hop – GrapRe [28], Multi hop – HOPE [23] proposed new improvements in which graph factorization considered father neighbors in addition to direct neighbors. Thus, they retain more information on the original graph. However, most above algorithms have high time-complexity in case of large graphs and many neighbors.
Deep learning based methods
Nowadays, deep learning based methods have shown outstanding performance in much different research fields such as computer vision, natural language processing, so on. The idea of applying deep learning to identify features in network data is promising. It opens new opportunities to overcome the problems of factorization-based methods. Input data can either the entire graph or a part of it (e.g. paths are extracted from the graph).
Wang et al. [6] with structural deep network embedding method (SDNE) applies a deep autoencoder to preserve the first-order proximity and second-order proximity of the vertices when embedded. The authors optimize at the same time both types of order proximity. The composition of multiple layers of highly non-linear function in deep structure is used to map the data into an embedding space.
In the paper [29], Cao et al. adapt a random surfing along with stacked denoising autoencoder to extract complex features and model non-linearities. This method is named DNGR (deep neural networks for graph representations). The method includes three components: random surfing, calculation of PPMI matrix and feature reduction. The positive pointwise mutual information (PMI) is used to ensure an autoencoder can capture the graph’s information in higher order proximity. Moreover, autoencoder in stacked structure explored via deep learning can effectively reduce noise and enhance robustness in generating informative representations. Although there is an improvement in reducing the dimension of input data compared to SDNE, the DNGR method also encounters issues such as high time complexity as well as optimal capacity in sparse spaces.
DeepWalk [5] builds a multi-label network classification based on learning embedded features. To retain high order proximity, Perozzi et al. maximize the probability of observing the last node and the next
Similar to DeepWalk approach, node2vec [2] preserves high order proximity by maximizing the log-probability of observing a node chain on a random walk of fixed length. The important difference, compared to DeepWalk, is that node2vec designed with a biased random walk procedure has balanced the cost of time complexity between breadth-first sampling and depth-first sampling. In other words, node2vec achieves a balance in observing between local and global information. For this reason, node2vec can create higher quality and more information embedded features than DeepWalk.
However, in these works, there are still problems having not been solved including:
The real data is usually a sparse network, greatly affecting the efficiency of the model. The training phase in order to have good features usually is time-demanding, especially for large graphs. Overcoming this difficulty is challenging in terms of practical usability and network scalability.
In summary, this has motivated us to develop a method that can run on sparse network data while also reducing training time. In the next section, we will present our method for these problems.
Our approach is built on DeepWalk algorithm proposed in 2014 and then improved by node2vec in 2016. These were the early algorithms to adapt the SkipGram model from NLP algorithm in feature learning. Our model has three phases shown in Fig. 2. In the first phase, we do a sampling of nodes from the original graph. Next, we apply deep learning to extract features. After that, we use the extracted features as input data for multi-label classification [11].
Multi-label classification system with feature learning.
Our improvements are mainly in the first two phases. Specifically, we focus on creating better graph representation and speeding up the feature learning process. In the third phase, we simply use one-vs-rest logistic regression implemented by LibLinear [28] to evaluate the effectiveness of our improvements. We will describe these phases in detail in the following subsections.
Like the DeepWalk and node2vec algorithms, we also use the random walk for sampling nodes in the graph. In this step, input data in the graph will be converted into smaller samples but still retain the structure information.
In details, with the given node
where
In DeepWalk and node2vec algorithms, to ensure randomness without missing too many edges from the original graph, the authors choose a sampling strategy in which several random walks (
As we mentioned in the previous sections, an important property is that the network is sparse. We rely on this property to propose a new sampling strategy. Instead of focusing on all the nodes in the network, we will take the sample based on the edge set so that each edge is selected at least one time. To do that, we take turns in selecting each edge and make the sampling procedure on this edge by random walk approach. In this way, all nodes are guaranteed to be selected. Our sampling strategy is described in Algorithm 1.
Another variation in our method is that for each edge we only perform sampling once instead of many times in the previous methods. That does not affect the quality of the sample set taken. Noticeably, our main goal is to need multiple paths to pass through one node. These paths next will be input data for the feature learning model and the output then is the embedded feature of each node for later machine learning problems.
At first glance, our sampling strategy will increase the size of the sample set. This may be true for dense graphs but in sparse graphs, our method is still at a reasonable size level. To clarify this point, we compare the magnitude of the sample set between node sampling method and edge sampling method in a sparse network.
Let
With node sampling method, the number of edges is:
With the edge sampling method, the number of edges is:
For the recommended configuration, in DeepWalk and node2vec approaches,
In almost datasets we surveyed, the ratio of the number of edges and number of nodes is near
This ratio function proves that the size of the sample set in our approach is generally of smaller value. In other words, our sampling strategy will be more effective than the other mentioned sampling strategies in sparse networks.
After completing the sampling, we will bring these samples into feature learning procedure to discover features or representations of each node in the network. The random walk at node
We modify SkipGram [5] to learn the features more quickly. Input data for SkipGram model is a sample set which includes random walks. Because SkipGram comes from NLP in which input data are sentences in documents, we treat each random walk as a sentence so that they can be fed into the algorithm. Each node on the walk corresponds to a word and the order of the nodes is also the order of the words in the sentence.
Given graph
where
Maximizing the co-occurrence probability is maximizing the probability of its neighbors in the walk. However, this work in a large graph requires high computational cost. Alternatively, Hierarchical Softmax [5] is used to approximate the probability distribution. We put Softmax at the output layer to estimate the co-occurrence probability of a node with all its neighbors in the network.
In a nutshell, the Skipgram (Fig. 3) in a neural network includes three layers: the input layer, hidden layer, and output layer. At the input layer, each node
SkipGram model for feature learning.
While implementing Softmax, we discovered a problem in which we could improve it. The optimization process is done by SGD (Stochastic Gradient Descent) [14]. SGD is not suitable for speeding up the algorithm by the parallel system. Authors in DeepWalk and node2vec converted SGD to asynchronous version (ASGD), so parallel can be applied by using multiple threads. However, it has not reached a high parallelism level. We try to change SGD by MGGD (Mini-Batch Gradient Descent) [30]. And we realize that the algorithm’s speed increased, compared to the old method. This will be proven in our experiments.
We summarize all our improvements in an algorithm named EdgeSliding (Algorithm 2). It is like DeepWalk and node2vec but there are some changes in the sampling step and feature learning step as described above.
Because our approach is based on two well-known approaches: DeepWalk and node2vec, we will compare our work with theirs to show the improvements. The data sets, evaluation methods and parameters proposed by the authors are mostly used to ensure objectivity, generalizability, and reliability in the comparison process.
Datasets
Three published datasets are used for evaluation, including Blogcatalog [2], Protein-Protein Interactions [2], and CiteSeer [13]. These dataset’s attributes are presented in Table 1.
Description about datasets where
is number of vertices,
is number of edges,
is number of labels, and
/
is ratio between
and
Description about datasets where
Briefly, the main characteristics are as followed:
BlogCatalog: a social network that shows bloggers’ relationships on BlogCatalog website. Labels show bloggers’ concerns through links between bloggers. The network has 10,312 nodes, 333,983 edges, and 39 different labels. Protein-Protein Interactions (PPI): a subgraph of the PPI network for Homo Sapiens in which nodes represent the interacting proteins and edges denotes all the detected pairwise interactions between proteins. The network has 3,890 nodes, 76,584 edges, and 50 different labels. CiteSeer: a citation network between scientific publications in computer science. The labels in the dataset represent the research areas of the paper. The papers are separated into 6 categories: Agents, Artificial intelligence (AI), Database (DB), Information retrieval (IR), Machine learning (ML), and Human-computer interaction (HCI). The network consists of 3,312 nodes, 4,732 edges and 6 different labels. This dataset is a sparse network.
For each dataset, we use the same data organization as DeepWalk and node2vec. It means that we randomly select some labeled nodes for the training set, called
In our experiment, we use Micro-F1 and Macro-F1 to evaluate the performance [24]. Their definitions are listed as follows:
Micro-F1 score is the harmonic average of the precision
where Macro-F1 score is the average measure of
where
Before the experiments, we need to set up parameters for each dataset. In DeepWalk and node2vec algorithms, we reuse almost parameters which authors chose as a best suitable parameter in their papers. Details of the parameters are listed as follows:
In our approach, we choose parameters as presented in Table 3.
Our parameters for datasets
The meaning of parameters in the above models are:
d: the size of the embedding vector. W: window size. L: walk length.
p and q: used to control the walk sampling process in node2vec.
Firstly, we tried applying parameters in the original algorithms and obtained some results. After that, we changed parameters such as window size, walk length to lower values in order to improve running time. Because our sampling strategy is on edge, we don’t need window size and walk length to be too high to get the full structure information of the network. We applied a k-fold cross validation and elbow method to identify the optimal parameters for our model.
We measured the speed of a learning process including the sum of the sampling time and the feature training time (SkipGram). The results are shown in detail in Fig. 4.
Run time of DeepWalk, node2vec, and our EdgeSliding on three datasets.
In PPI and CiteSeer dataset, our approach has better run time. In BlogCatalog dataset, although our approach is slower than DeepWalk but still better than node2vec. Because the network in BlogCatalog is dense, our algorithm took too many samples. As a result, this needed a long time to learn features.
In general, it can be said that in the sparse networks, our approach has a substantially lower sum of the sampling time and the feature learning time than the other approaches.
Results of multi-label classification for DeepWalk, node2vec, and our EdgeSliding on three sets of data according to Macro-F1 measurement with the ratio of labeled nodes
Results of multi-label classification for DeepWalk, node2vec, and our EdgeSliding on three sets of data according to Micro-F1 measurement with the ratio of labeled nodes
Macro-F1 score chart of DeepWalk, node2vec, and our EdgeSliding.
Micro-F1 score chart of DeepWalk, node2vec, and our EdgeSliding.
Table 4 and Fig. 5 shows the results of Macro-F1 measurement for multi-label classification. Overall, our approach has a higher Macro-F1 score than DeepWalk and node2vec on all three datasets with different training ratios. According to results from Table 5, the proposed method also has a higher Micro-F1 score on all datasets (Fig. 6).
In conclusion, through experiments, we proved our approach has some encouraging results as followed:
Sampling strategy achieves good coverage on the dataset. Therefore, the feature vectors contain more information from the original data, helping improve the quality of latter machine learning. Time to learn features is faster in the networks which have a low ratio between the number of edges and the number of vertices (i.e. the sparse graph). Specifically, we experimented on three datasets whose ratios decrease in the order: Blogcatalog
In this paper, we proposed an approach to reduce the feature learning time and improve the quality of feature vectors, compared to DeepWalk and node2vec algorithms, especially on sparse networks. The overall idea is that instead of taking samples based on nodes, we take samples based on the edges of the network. This helps get almost all edges in the network. It means that more information can be retained. Although all edges are chosen, we do not lose the important characteristics of the random walk method. We also proved that our approach is scalable in large sparse networks. In addition to the quality of feature vectors, we applied other optimization functions to take advantage of parallel computing. The result is the runtime has improved. With many different datasets, experiments illustrated the effectiveness of the proposed approach for the multi-label classification.
In addition to scalability, our model is designed into modules independent of each other. Therefore, for future work, we could apply parallel computing to increase even more the speed of the learning process. Additionally, our approach is built on a linguistic model so any future improvement in natural language processing will also enable us to improve our algorithm.
Footnotes
Acknowledgments
This paper is supported by the Research and Development Support Program of Faculty of Information Technology, University of Science, VNU-HCMC.
