Node classifications with DjCaNE: Disjoint content and network embedding

Abstract

Machine learning approaches have become a crucial tool in graph analysis. Despite the accurate results of the existing approaches, most of them are not scalable enough to be used in real-world problems. Networks provide two different kinds of information, nodes contents and nodes relations (network structure). Training deep graph neural networks (GNN) over large-scale graphs is challenging due to the limitation of the message passing framework. Graph Convolutional Networks (GCN) work on all node neighbours at once. Furthermore, it is usual to transform node features with a deep neural network before the GC operation. Therefore, the deep transform operation may apply up to hundreds of times for each target node which is heavy computation and hard to batch. This paper presents an abstract framework with two embedding components, the first component embeds node relations, and the second one embeds node contents. The model makes predictions by aggregating these embeddings through a combination component. The presented approach limits the deep transform only to the target node and uses random walk-based embedding instead of the GC operator to reduce the cost. The main goal of the proposed approach is to provide a light framework for the task. To this aim, node relations are embedded based on node neighbourhood structure by a biased variant of the DeepWalk model, called GuidedWalk, and an autoencoder embeds node contents. The experimental results on three well-known datasets show the superiority of the proposed model compared to the state-of-the-art GraphSAGE and TADW models with less computational complexity. On the Citeseer, Cora, and PubMed datasets, the model has achieved 3.23%, 0.88%, and 7.63% improvement in Macro-F1 and 3.25%, 0.7%, and 6.34% improvement in Micro-F1, respectively. Although GNNs are state-of-the-art models, considering node content is their main advantage. This paper shows that even a simple integration of node content to available random walk-based methods improves their performance up to GCNs without increasing the complexity.

Keywords

Citation networks deep-learning network embedding text embedding

1. Introduction

Graph representation learning is the process of finding latent vector representations of a part of a graph, including nodes, edges, and subgraphs [1]. Many previous applications encode nodes as low-dimensional vectors that summarise their graph position, the structure of their local graph neighbourhood [2], and their content. Graphs representation learning is applied in many real-world problems such as molecular fingerprints [3], recommender systems in online markets [4], citation networks classification, machine translation [5], online reviews analysis, community question answering, and social networks.

Graph representation learning for node embedding can be categorised into three categories based on the information they process. The first category does not consider node relations and performs analysis only based on node contents. The second one only uses entity relations and graph theories. The third category combines contents and relations data to reach state-of-the-art performance. Top performance methods are based on convolution neural networks operating on the graph Laplacian matrix (Laplacian matrix size is $| V^{2} |$ where $| V |$ is the number of nodes in the graph). Some methods use other matrices of the same size. The problem with all these methods is that they are not scalable due to dealing with big sparse Laplacian matrices. This paper presents a new framework that merges the first two embedding methods to achieve better performance and remain scalable. Therefore, the proposed modular model not only improves the previous random walk-based model by a large margin but also performs as well as complex deep learning models like GraphSAGE [6] while using a much simpler architecture with similar or lower complexity, which brings scalability and generalisation.

The importance, size, and growth of publication networks motivated researchers in recent years to focus on this dataset [7,8]. Document classification [9] is one of the basic tasks in text mining and information retrieval, which has also been used to address the problem. Many of the previous works focused on either textual data [10 –12] (equal to first category) or graph-based models [13 –15] (equal to second category) to solve the classification problems. Applying joint models (equal to the third category) has received researchers’ attention in recent years [16,17], where publications’ contents and their citations are used jointly to improve the classification accuracy. While most of the joint models depend on GC [18], the proposed framework separates node content data from graph structure to make it simpler. The proposed framework is evaluated over three well-known citation networks, namely Citeseer, Cora, and PubMed, in the task of publication classification. Experimental results show that aggregation of components can result in a competitive performance while the model has fewer parameters to train and can be generalised better. This resulted in a new framework Disjoint Contnet and Network Embedding (DjCaNE) which has three components: a Network Embedding (NE) component, a Content Embedding (CE) component, and a combination component to do the final prediction.

The contributions of DjCaNE are as follows:

A new lightweight, scalable graph embedding framework is proposed that can be trained component by component with final superior performance. While previous models are based on manual feature engineering or end-to-end deep learning architecture, the proposed model consists of aspect-oriented modules that can be trained component by component. Therefore, the proposed framework achieves superior performance without facing either manual feature engineering or the complex training process of deep models.

The framework can extend random walk-based methods with a very low cost to achieve results compatible with scalable deep learning models.

An unsupervised content embedding is presented that improves the classification performance even without any advanced knowledge about data. Furthermore, the proposed framework is general and can be used with any content data, such as images, videos, and even a media sequence (e.g. Instagram pages).

The rest of the paper is organised as follows: Section 2 reviews related works for node classification problem, with a focus on publication classification in citation networks. The proposed DjCaNE is presented in Section 3, and evaluated in Section 4. Section 5 summarises DjCaNE model and future works.

2. Related works

This section reviews the main studies in network embedding. Traditional network embedding algorithms required manual feature engineering by system experts. The representation learning methods, however, automatically extract features. Representation learning methods can be categorised in different ways. This section focuses on machine learning-based methods and divides them into two categories based on whether they use a random walk strategy. Goyal and Ferrara [19] and Cai et al. [20] presented detailed surveys on the network embedding problems and solutions.

2.1. Random walk-based methods

DeepWalk [21] is the pioneer in machine learning-based network embedding. Inspired by word embedding in natural language processing, DeepWalk sets up a limited number of random surfers from each node and considers the trace of each random surfer as a sentence. Then, it uses a language model, called Skip-gram [22,23], to embed nodes. In random walk-based methods, random surfing starts at a node chosen by a uniform probability. Then, the random surfer decides to move to a neighbour node or teleport to other nodes based on a transition strategy, $S$ , which is a distribution function (more often, teleportation probability is equal to zero). Considering that a path traced by this random surfer is shown by $υ_{{0}}, υ_{{1}}, υ_{{2}}, \dots, υ_{{t}}$ , where subscript ${.}$ shows the sequential numbering. Skip-gram is an approximation method for optimization of equation (1)

min_{y} - \sum_{i = 1}^{| V |} \log \sum_{υ_{j} \in N_{i}^{S}} P (υ_{j} | y_{i})

(1)

where $P (.)$ is soft-max function ( $P (υ_{j} | y_{i}) = \frac{\exp (y_{j}^{T} y_{i})}{\sum_{k = 1}^{| V |} \exp (y_{k}^{T} y_{i})}$ ), $y_{i}$ is vector representation of $υ_{i}$ , and $υ_{j} \in N_{i}^{S}$ if there is a trace in which the distance of $υ_{i}$ and $υ_{j}$ is less than $w$ ( $w$ is a hyperparameter called window size). $S$ emphasises on the transition strategy.

Tang et al. [24] proposed the Large-scale Information Network Embedding (LINE), which embeds adjacent nodes and second-order proximity of each node separately. The final embedding in LINE is the concatenation of both embeddings. While DeepWalk considers uniform transition probability distribution between two adjacent nodes, Grover and Leskovec [13] defined Node2Vec with a different sampling strategy which is based on breadth-first search and depth-first search algorithms. Many researchers followed DeepWalk and defined other strategies based on this technique. Ribeiro et al. [25] proposed Struct2Vec, a model that uses random walks to embed nodes such that structurally similar nodes are placed close to each other in the embedding space. Dong et al. [26] used meta paths to ease heterogeneous network embedding. Random walk-based methods show a good performance in network embedding, and they can scale up to large graphs. These algorithms’ complexity is equal to the sum of random surfers complexity and Skip-gram complexity which is bounded by $O (V \log (V))$ [27] where $V$ is the number of nodes.

2.2. Non-random walk-based methods

The first Non-Negative Matrix Factorization (NMF)-based methods were presented years before random walk-based approaches. Although they achieved good performance, they are not as scalable as random walk-based methods. Therefore, the recent advances in this category are integrating the random walk-based methods concept. Equation (2) shows the basic concept of NMF-based graph embedding

min ∥ W - Y {Y^{c}}^{T} ∥

(2)

where $W$ is a node-node similarity matrix, and $Y, Y^{c} \in R^{| V | \times d}$ are the node embedding and context embedding matrices. The difference in methods is in the definition of $W$ . For example, Cao et al. [28] embedded networks by factorization of higher-order proximity matrices. TADW [16] used the relationship between DeepWalk and NMF methods and presented an NMF-based method that works like DeepWalk while using nodes content at the same time. Allab et al. [29] presented Semi-NMF-PCA to take advantage of considering the clustering and dimensionality reduction simultaneously. To this end, they defined an objective function based on three items: Principal Component Analysis (PCA), K-means, and node embedding. Qiu et al. [30] defined a unified NMF framework that simulates DeepWalk, LINE, Node2Vec, and PTE [31]. The problem with the NMF-based method is working with large node-node similarity matrices. Qiu et al. [30] showed that each random walk-based method is equal to factorising matrix $W$ implicitly.

A graph is connected with matrices which are $O (V^{2})$ , such as node-node similarity, adjacency, and Laplacian matrices. In successful applications of deep learning such as image processing, instances are not linked explicitly, and the size and the format of instances are limited, which enables advanced techniques, including transfer learning and pre-training. In contrast, links are very important in graphs, and nodes must be embedded by considering their relations. Therefore, many state-of-the-art graph embedding methods cannot be applied to different graphs as they have different Laplacian matrix sizes, which affect the architecture’s hyper-parameters. They encode different relations, which makes the trained neural network invalid.

Kipf and Welling [32] designed convolutional neural networks to operate directly on graphs. Their work provided the basis for further research studies. GraphSAGE [6] embeds nodes in a multi-stage deep structure in which, at each stage, node embedding is concatenated with an aggregated embedding of node neighbourhood and is passed forward to the next stage. GraphSAGE is one of the few works that use inductive learning. Veličković et al. [33] used a multi attention mechanism for neighbourhood aggregation. Xu et al. [34] generalised the aggregation process in convolutional neural networks and showed that the sum operator is the most powerful aggregator.

PyG [35] and DGL [36] are famous libraries based on message passing framework to implement graph neural networks. Each stage of the message passing framework is theoretically shown in equation (3)

h_{u}^{(k + 1)} = UPDAT E^{(k)} (h_{u}^{(k)}, AGGREGAT E^{(k)} ({h_{υ}^{(k)}, \forall υ \in N (u)}))

(3)

where $u, υ \in V$ are nodes, $h_{u}^{(k)}$ is representation of node $u$ at layer $k$ , $N (u)$ is $u$ ’s graph neighbourhood, and $AGGREGATE$ $UPDATE$ are arbitrary differentiable functions. Algorithm 1 shows the overview of the message passing framework.

Algorithm 1.

Message Passing Algorithm.

Require: graph

G (V, E)

feature matrix

X

where

X_{i}

is feature of node

υ_{i}

K

number of layers
neural functions

AGGREGATE

and

UPDATE, \forall k \in {0, \dots, K - 1}

neighbourhood function

N \to 2^{V}

such that

N (i)

i

’s graph neighbourhood
Ensure: matrix of vertex representations

Y \in R^{| V | \times d}

1: Initialization:

h_{i}^{(k)} = X_{i}, \forall υ_{i} \in V

2: for

k \in {0, \dots, K - 1}

do
3: for

υ_{i} \in V

do
4:

m_{i}^{(k)} = AGGREGAT E^{(k)} ({h_{j}^{(k)}, \forall υ_{j} \in N (υ_{i})})

h_{i}^{(k + 1)} = UPDAT E^{(k)} (h_{i}^{(k)}, m_{i}^{(k)})

6: end for
7: end for
8:

Y_{i} = h_{i}^{(K)}, \forall υ_{i} \in V

The main drawback of all mentioned methods is their complexity or incompleteness; on the one hand, complex methods cannot be performed on conventional processing machines. On the other hand, lightweight, scalable models, such as DeepWalk, do not consider rich node content information for embedding. Therefore, a scalable machine learning framework is still an open research question [37]. Random walk-based methods are scalable and fast, but they cannot use node content in the embedding process. A limited number of research studies, such as NavWalker [38] and SaC2Vec [39] offer tricks to merge node content in random walk-based embedding. Nevertheless, it makes the model as complex as deep learning models without improving performance significantly.

This study separates network structure from node content embedding, resulting in a simpler architecture than joint frameworks. It is possible to utilise any algorithms mentioned above in the presented framework to embed the network. Since random walk-based models cannot use node data and deep learning methods are computationally expensive, the proposed DjCaNE framework uses two lightweight components in parallel to embed graphs with high quality and low complexity. Experimental results show that the proposed framework extends random walk-based methods efficiently to achieve accuracy near to deep methods.

3. DjCANE

In designing machine learning models, a general rule of thumb is that joint optimization and end-to-end models achieve better performance. Joint models, however, are computationally expensive, especially on large-scale graphs. Furthermore, it is hard to train many parameters; that is, the training process needs more training examples to overcome the over-fitting problem. In graphs, the node’s content and neighbourhood contain information, and a model can advantage both/one of them to predict the label. The prediction must be more accurate if the model uses both information. Joint matrix factorization [16] and GCN [32] are two possible solutions for integrating them. However, their performance improvement is always compared to methods of the same category or the methods that cannot process one of the information sources. For example, consider the DeepWalk method; it cannot process node content and is outperformed by GNNs, which is not a fair comparison. This paper provides a way to add node information to any method like DeepWalk and measures how it improves the results. Interestingly, this fills the performance gap.

DjCaNE is based on components that can be trained separately. It captures two embedding components called NE and CE. NE embeds the relation information of nodes, for example, citation information of publication, and CE embeds node content, for example, textual information of publication. A third component aggregates results and makes final predictions based on the output of NE and CE. DjCaNE is modular in architecture and training. It is trained component by component to avoid complexity in the model. As a result, DjCaNE is very light and does not require any strategy to handle over-fitting. The component-based architecture is more interpretable and makes the model flexible for other node content types such as media. Figure 1 shows the structure of the DjCaNE. The NE module contains a random walk-based network embedding method (e.g. DeepWalk or Node2Vec). These models run many random surfers on the graph and embed the traces with the Skip-gram language model. Therefore, the NE component embeds proximity-based similarities in a $R^{| V | \times d}$ matrix. In the CE module, node features are embedded into another $R^{| V | \times d}$ matrix using dimensionality reduction techniques (e.g. PCA or Auto-encoders). In the combination module, these matrices are concatenated in a $R^{| V | \times 2 d}$ matrix and a classifier (e.g. MLP or SVM) is trained on this matrix.

Figure 1.

The architecture of the DjCaNE model.

The following subsections describe the detailed structure of each module. It has to be mentioned that each module can be replaced with any other network/content embedding algorithms. This work uses simple neural models and keeps the model as simple as possible to show the capability of the proposed framework in node classification tasks without complex processing. The proposed implementation can be used as an extension of all random walk-based network embedding algorithms with shallow neural networks that improve them without increasing the complexity.

3.1. NE component

The architecture is not limiting modules, and NE can be any embedding algorithm that can learn a low-dimensional representation of nodes in a network. Random walk-based methods, such as DeepWalk and Node2Vec, which learn network topology, fit the module NE as they do not consider node attributes, which are placed in a separate module in DjCaNE. Since evaluation is done on classification tasks, this work uses an improved version of DeepWalk, named GuidedWalk. This random walk-based algorithm only uses link information and is optimised for node classification tasks. Random walk-based algorithms use Skip-gram to embed nodes using random surfers traces. Moreover, Skip-gram is designed to assign similar vectors to nodes that have been seen near together [22,23,30,40]. Consequently, the random surfer propagates latent features through the surfing process. Therefore, the base idea of GuidedWalk is defined as follows:

Lemma 1. random surfer feature propagation. Random surfers are propagating latent features; that is, nodes that random surfers visit more in the window $w$ have more similar features.

Proof. In Equation 1, $N_{i}^{S}$ is context nodes of $υ_{i}$ and determined by random surfer. If random surfer visits nodes $υ_{i}$ and $υ_{j}$ more, $P (υ_{j} | y_{i})$ increases. Therefore, according to $\frac{\exp (y_{j}^{T} y_{i})}{\sum_{k = 1}^{| V |} \exp (y_{k}^{T} y_{i})}$ the relative similarity of $y_{j}$ and $y_{i}$ must increase as $\exp$ is monotonic increasing.

Given a partial node labelling function $l : V \to L$ , where $L$ is set of possible labels union with $unk$ for unknown labels, GuidedWalk goal is to embed nodes such that accuracy of classification for unlabeled nodes $U = {u | u \in V and l (u) = unk}$ is maximised. GuidedWalk is biased towards the paths that contain nodes of the same class; that is, GuidedWalk avoids paths that contain nodes from different classes consecutively, which has the following advantage: for all node $υ_{i}$ , $N_{i}^{S}$ has more chance to contain nodes of the same class. This bias may not be valid for nodes close to more than one class. Therefore, nodes that appear in a random surfer trace generated by GuidedWalk have more chance of having the same label, and according to Lemma 1, these nodes have similar latent features which ease the classification. GuidedWalk can use any strategy that prefers to move through edges that connect nodes of the same class to reach this goal. However, as a portion of node labels are unknown, many edges of a graph are connected to nodes with unknown labels. Therefore, it is hard to find a strategy $S$ to make $N_{i}^{S}$ with the conditions mentioned above. Therefore, random surfers use an online label approximation which is inspired by label propagation algorithm [41] and crowd knowledge distribution [42]. According to the homophily of social networks, nodes of the same type are naturally close together. Therefore, each random surfer can consider the label of the current node equal to the last seen label. Therefore, each random surfer makes decision for choosing next node, $υ_{j}$ , independently based on the last visited known label, $L_{k} \neq unk$ , and number of consecutive $unk$ labels visited $c_{u}$ just before current position, $υ_{i}$ . Although a single random surfer prediction may not be accurate, having many random surfers, social wisdom happens, and the most propagated label to $υ_{j}$ is a good prediction of its label. The main difference between this random walk strategy and basic label propagation or K-nearest neighbour is

Random walk strategy does not use label counts directly. Instead, it just uses predicted labels to generate paths and embed the nodes,

Random walk strategy does not only use adjacent nodes. It uses $c_{u}$ as an aging factor to integrate labels information from more distant nodes,

Random walk strategy uses many random walks, which removes the effect of the initial state and processing order.

Considering the above assumptions, equation (4) sums up transition probabilities of random walk strategy in GuidedWalk

p t_{GW} (υ_{i}, υ_{j}, c_{u}, l_{k}) = {\begin{matrix} \frac{e}{z e^{(\frac{c_{u} + c_{0}}{s_{same}})}} + α / z if l_{k} = l (υ_{j}) \\ \frac{e}{z e^{(\frac{c_{u} + c_{0}}{s_{white}})}} + α / z if l_{k} = unk or l (υ_{j}) = unk \\ \frac{e}{z e^{(\frac{c_{u} + c_{0}}{s_{oppo}})}} + α / z if l_{k} \neq l (υ_{j}) \end{matrix}

(4)

where $υ_{i}$ represents the current node, $υ_{j}$ is the neighbour node that GuidedWalk wants to calculate the portability of transition, $l_{k} = l (υ_{k})$ is the last known label visited before the current node ( $υ_{i}$ ) and it is $unk$ at initialization of random surfer where no previous node exists, $c_{u}$ is the number of steps over $unk$ labels just before reaching node $υ_{i}$ ( $c_{u}$ is implemented as $c_{u} = \min (c_{u}, 15)$ to limit complexity and memory consumption), $e$ is natural number, $c_{0}$ and $α$ are scaling parameters, $s_{same}$ , $s_{white}$ and $s_{oppo}$ are the scaling factors (note that this condition must be held: $s_{same} \geq s_{white} \geq s_{oppo}$ to ensure the probability of transition between nodes of same class is larger than the probability of transition between nodes of different classes), and $z$ is a normalisation factor to ensure that sum of transition probabilities is 1 at any step.

It is important to mention that although equation (4) calculation is very time-consuming, they are repetitive and can be preprocessed. Then, GuidedWalk uses memorization techniques instead, as many of the cases will not happen. The GuidedWalk memory complexity is $O (15 | C | | E |)$ where $| C |$ is the number of classes and $| E |$ is the number of edges. The sampling complexity is $O (| V | γ t)$ and the embedding step complexity is $O (| V | γ t d_{NE} Qw)$ where $t$ is the number of random walks per node, $γ$ is the length of random walk, and $d_{NE}$ is embedding dimensions, $Q$ ¹ is the negative sampling factor ( $Q = 5$ ), and $w$ is window size ( $w = 10$ ). The complexity details are presented in Section 4.5.

3.2. CE component

CE is the content embedding component in the DjCaNE architecture. Generally, node content has high dimensions. CE aims to learn low dimension representation to be processed in an aggregator module. Auto-encoders (AE) are automated nonlinear dimensionality reduction tools that DjCaNE utilises for the CE component. One of the advantages of AEs is that they can encode any metadata available in addition to media content.

The proposed GuidedWalk explores the graph using paths consistent with the partially observed labelling function. Many successful studies use node content to improve the results as well. Topology-based node embeddings are behind the models that also benefit from node contents. In the CE module, DjCaNE takes advantage of content, textual data in the present paper, since the main goal is to keep DjCaNE lightweight and practical, a two-layer AE is used.

Consider content matrix $X \in R^{N \times F}$ where $R$ is the set of real numbers, $N = | V |$ is number of nodes, and $F$ is number of features, then the content embedding for node $v_{i}$ with feature vector $X_{i}$ is $Relu (X_{i} B)$ , where $B$ is a $F \times d_{CE}$ matrix and must be learned in a way that minimises loss function defined by equation (5)

los s_{CE} = \sum_{i = 1}^{N} (σ (B^{T} \cdot Relu (X_{i} B)) - X_{i})^{2}

(5)

where $σ$ is sigmoid function, $Relu (x) = \max (0, x)$ , and $B^{T}$ is transpose of $B$ . Having many different features, from scalar features to video information, preprocessing becomes more important. In this condition, pre-trained neural networks, such as LSTM, convert video, audio, and texts to vector representation, then the core CE component reduces the dimension. The training cost of Equation (5) is $O (| V | d_{CE} F)$ .

3.3. Combination and classification

The Combination module concatenates previous modules and forwards them to its internal classifier. Having an $d_{NE}$ -dimensional vector from NE and an $d_{CE}$ -dimensional vector from CE, the resultant vector with $d_{NE} + d_{CE}$ dimensions is fed to a classifier for final prediction.

For classification, two different classifiers are used: (1) Support Vector Machine (SVM) as a conventional supervised learning algorithm, and (2) Multi-Layer Perceptron (MLP) as a deep neural network model. The two alternatives can show the flexibility of DjCaNE framework in using different algorithms in each module. The complexity of SVM is $O (V d^{2})$ where d is an embedding dimension ( $d = d_{NE} + d_{CE}$ ). The complexity of MLP is also $O (V d^{2})$ considering the hyper-parameters used in DjCaNE implementation.

4. Experimental results

This section evaluates the proposed DjCaNE framework against state-of-the-art algorithms over three datasets.

4.1. Datasets

To evaluate performance of DjCaNE model, three well-known citation datasets that previously used by TADW [16], namely Citeseer, Cora, and PubMed² are used. The two former datasets are presented by Sen et al. [43], and the latter one is presented by Namata et al. [44] which are well-studied in literature. The goal of the experiments is to determine the category of each publication, having a portion of them labelled as the train set [45]. All datasets are citation networks where publications are presented by nodes and citations are presented by links. The bag of words model describes the content of each publication (node) in datasets. The size of the dictionary and network vary from one dataset to another. Table 1 shows general attributes of these datasets. The Citeseer dataset that includes publications in computer and information science has six classes: Human Computer Interaction, Machine Learning, Multi-Agent Systems, Artificial Intelligence, Information Retrieval, and DataBase. The Cora dataset that includes publications in machine learning consists of 7 classes: Neural Networks, Rule Learning, Reinforcement Learning, Probabilistic Methods, Theory, Case-Based, and Genetic Algorithms. The PubMed dataset that includes publications about diabetes has three classes: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. Table 1 defines the Sparsity metric, which is the number of observed links against the number of missing links; that is, the ratio of nonzero elements to zeros in the adjacency matrix of the dataset.

Table 1.

Statistics of Cora, Citeseer, and PubMed data-sets. The number of nodes in the datasets represents the number of documents, and the number of edges is the number of citations.

Dataset	#Nodes	#Edges	#Features	Ave Deg	Sparsity*	#Classes
Citeseer	3327	4732	3703	2.8	1:1198	6
Cora	2708	5429	1433	3.9	1:694	7
PubMed	19717	44338	500	4.5	1:4384	3

4.2. Evaluation metrics and baselines

Following the previous studies in the field, F1 measures (micro and macro) are reported for each experiment. DjCaNE performance is evaluated against the following state-of-the-art approaches:

DeepWalk [21]: DeepWalk uses short uniform random walks to learn representations for vertices in graphs.

TADW [16]: TADW uses a matrix factorization equivalent of DeepWalk and associates it with text features. TADW is one of the examples that use joint optimization to embed nodes.

GraphSAGE [6]: GraphSAGE is a decent algorithm for network classification and link prediction. GraphSAGE is one of few inductive network embedding algorithms. It incorporates node neighbourhoods to improve results. A non-inductive version of GraphSAGE is used, in which the adjacency matrix is fed to the model to have a fair comparison. The non-inductive version of GraphSAGE is a very heavy algorithm but increases the accuracy.

SaC2Vec [39]: SaC2Vec presents a node embedding technique by considering network structure and node content. SaC2Vec builds a multi-edge graph and embeds this graph. The problem with this model is the complexity of building the second layer of the graph, which is $O (V^{2} F)$ . SaC2Vec implementation is not released for public usage, and this work compares it to the DjCaNE method on Citeseer dataset based on the available results in their paper.

4.3. Experimental setups

In DjCaNE, each of the NE and CE components creates 128-dimensional embeddings. These vectors concatenate to produce 256-dimensional vectors that will be fed to a classifier for prediction. SVM and MLP are used as classifiers. This setting allows comparing a conventional machine learning model with a deep neural model. SVM uses radial basis function kernels with automatic coefficients and no iterations hard limit defined by Python Sklearn. The MLP has three hidden layers with sizes 256, 153, and 76 neurons, which are activated by $Relu = \max (0, x)$ . Adam optimizer with adaptive learning rate and regularisation parameter $α = 0.00001$ for 1000 iterations optimises MLP.

The baseline methods have been tested based on their reported parameters in the corresponding literature. DeepWalk and TADW methods are implemented by openNE,³ and GraphSAGE PyTorch implementation is available at the Github repository.⁴

In evaluation, the training set size varies from $10 %$ to $90 %$ of nodes and the rest of the nodes are test nodes; for example, in $10 %$ training in Table 2 for the Citeseer dataset, $332 = 3327 * 0.1$ nodes are train nodes and $2995 = 3327 - 332$ nodes are test nodes. Both macro-F1 and micro-F1 metrics are used to report the results on each dataset.

Table 2.

Evaluation of DjCaNE on Citeseer dataset.

Metric	Algorithm	Train size (%)
		10	20	30	40	50	60	70	80	90
Macro-F1	DeepWalk [21]	50.28	50.85	52.49	53.89	51.69	53.35	52.80	51.92	53.06
	TADW [16]	65.64	64.57	67.97	67.67	67.66	68.45	67.32	66.30	63.13
	GraphSAGE [6]	64.50	66.44	66.76	68.09	66.96	69.17	68.59	68.07	72.11
	SaC2Vec [39]	62.14	64.19	65.89	67.20	68.18	–	–	–	–
	DjCaNE-SVM	66.24	69.56	71.73	72.89	73.34	74.52	74.86	75.17	75.34
	DjCaNE-MLP	60.09	62.34	65.12	65.44	66.80	67.80	68.15	69.82	70.80
Micro-F1	DeepWalk [21]	54.41	55.32	57.35	58.55	56.52	58.11	58.35	58.06	58.13
	TADW [16]	71.02	70.65	73.33	73.19	73.01	73.89	72.84	71.80	69.58
	GraphSAGE [6]	68.50	70.06	70.72	71.68	70.23	72.75	72.93	70.73	75.30
	SaC2Vec [39]	69.10	71.84	73.07	74.01	74.61	–	–	–	–
	DjCaNE-SVM	70.60	73.10	75.07	76.65	76.65	78.39	78.44	78.54	78.55
	DjCaNE-MLP	66.70	68.77	68.86	70.06	70.91	71.50	71.50	72.75	74.50

4.4. Results

4.4.1. DjCaNE performance

Evaluation results over Citeseer are presented in Table 2. As can be seen, DjCaNE outperforms all baseline models, except for the micro-F1 at $10 %$ training set, where TADW achieved better results. One reason is that the NE component of DjCaNE uses training data, and it needs more labelled data to perform well. At the same time, TADW is an unsupervised embedding model which needs labelled data only in the classification step. These results show the superiority of DjCaNE as a lightweight framework which achieved the best result in 19 out of 20 tests.

Evaluation results over Cora are presented in Table 3. While it is not easy to find a winner when the training set is small, DjCaNE is the absolute winner in larger training sets. The limited performance of DjCaNe on small training data set generally comes from the size of Cora, where the network is small and has a limited textual feature vector. There are difficulties in training NE components and the combination layer when using a smaller portion of the network as training data. In Citeseer, the network is large and textual features are more informative, which can cover the small size of the network in most cases. According to this reasoning, superior performance on PubMed is expected.

Table 3.

Evaluation of DjCaNE on Cora dataset.

Metric	Algorithm	Train size(%)
		10	20	30	40	50	60	70	80	90
Macro-F1	DeepWalk [21]	75.81	79.40	81.17	82.34	82.44	83.44	84.77	84.58	81.93
	TADW [16]	80.18	82.87	82.53	82.82	83.82	85.13	86.35	85.83	85.91
	GraphSAGE [6]	78.28	82.50	82.06	84.88	84.07	83.25	84.89	81.28	86.56
	DjCaNE-SVM	74.52	80.86	83.04	84.40	85.61	86.38	88.86	86.41	87.44
	DjCaNE-MLP	75.52	80.50	80.70	81.89	84.38	82.36	85.32	85.46	87.29
Micro-F1	DeepWalk [21]	76.53	80.20	81.90	83.13	83.83	84.31	85.23	85.05	82.28
	TADW [16]	80.96	84.20	83.96	84.18	85.08	85.88	87.20	86.53	86.34
	GraphSAGE [6]	79.24	83.26	82.96	85.52	85.52	84.87	85.60	82.47	87.82
	DjCaNE-SVM	78.31	83.49	84.85	85.16	87.28	88.71	90.03	88.17	88.52
	DjCaNE-MLP	78.33	82.27	82.50	83.27	85.30	84.00	86.27	86.60	87.79

Analysing the experimental results on PubMed (Table 4), it is notable that when the graph is large, DjCaNE has perfect results. It significantly outperforms other methods by a large margin. It is worth mentioning that on this dataset, simple methods work better; that is, DeepWalk outperforms TADW and GraphSAGE without considering textual data. The DjCaNE framework is as simple as DeepWalk and considers nodes’ content which helps to outperform all other methods. DjCaNE’s NE component and DeepWalk consider graph higher-order proximities, while TADW and GraphSAGE are limited to first and second-order proximities. In Citeseer and Cora, the dimension of textual data is near to graph size, while PubMed consists of 19717 publications with only 500 features. Due to the TADW factorization formula, it is hard for the model to extract features from PubMed.

Table 4.

Evaluation of DjCaNE on PubMed dataset.

Metric	Algorithm	Train size (%)
		10	20	30	40	50	60	70	80	90
Macro-F1	DeepWalk [21]	78.11	79.43	79.60	79.69	79.33	79.70	79.29	78.89	79.27
	TADW [16]	30.78	30.56	30.23	30.26	30.13	30.07	30.32	29.99	29.39
	GraphSAGE [6]	47.81	48.44	48.29	48.12	48.12	48.52	48.38	48.28	47.24
	DjCaNE-SVM	83.47	85.04	85.71	86.07	86.02	86.12	86.79	87.00	86.67
	DjCaNE-MLP	85.04	85.67	85.81	86.87	86.15	86.67	87.13	87.12	86.90
Micro-F1	DeepWalk [21]	79.53	80.74	80.93	81.12	80.86	81.13	80.86	80.27	80.73
	TADW [16]	40.31	40.62	40.57	40.74	40.57	40.56	40.99	40.46	39.90
	GraphSAGE [6]	63.23	65.06	64.78	64.68	64.83	65.18	64.68	64.83	63.69
	DjCaNE-SVM	85.48	85.50	86.05	86.50	86.63	86.53	87.02	87.09	87.07
	DjCaNE-MLP	85.32	85.88	86.05	87.10	86.71	86.94	87.27	87.21	87.07

Comparing the performance of DjCaNE-SVM with DjCaNE-MLP on the three datasets, the SVM classifier has better performance on Citeseer and Cora, while the MLP performs the best on PubMed. It is also due to the different sizes of networks. The MLP configuration has much more parameters to learn and works better on large training data. Citeseer and Cora are approximately seven times smaller than PubMed and their training data is not large enough to tune the parameters of a deep neural network. Having more data in PubMed helps the model to show its capability when classifying publications with advanced classifiers.

DjCaNE’s performance against state-of-the-art models restates two points: (1) despite trending deep learning models, there are domains in which well-designed machine learning models can achieve equal or better performance, and (2) joint analysis are not necessarily better than separate components. It is possible to achieve state-of-the-art results by combining accurate disjoint components.

4.4.2. Performance of DjCaNE components

In this experiment, the effectiveness of the DjCaNE is compared to its two base components. Table 5 compares the results of DjCaNE components on different datasets and metrics using 80% of nodes as train data and SVM as the classifier. The tabulated results indicate the following points:

CE and NE components can result in promising results independently, and DjCaNE always shows superior results.

In the classification task, it is not apparent which component is more informative. In the Cora dataset, comparing NE and CE components illustrates that network topology is much more informative. While in other datasets, the results of NE and CE are almost equal.

Simple CE and NE components can be trained on PubMed while TADW and GraphSAGE faced problems in fitting to data due to the architecture complexity.

Table 5.

Effectiveness of DjCaNE components on different datasets and metrics using 80% of nodes as train data using SVM.

Component/metric	Cora		Citeseer		PubMed
	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1
NE	86.27	86.20	67.87	70.29	82.82	83.84
CE	66.14	71.71	67.33	71.90	81.70	82.21
DjCaNE	86.41	88.17	75.17	78.24	87.00	87.09

NE: Network Embedding; CE: Content Embedding.

4.4.3. Performance of DjCaNE with alternative modules

In this section, experiments are done on different NE and CE modules variations. GuidedWalk, DeepWalk, and three variants of Node2Vec are used in the NE module. PCA and AE are used in the CE module. The results are reported based on the SVM classifier on 80% train rate. Table 6 shows the results of experiments. The main observations of these results are as follows:

The proposed framework is general enough to use any method in the NE and CE components. The main issue is combining a random walk-based model with a content representation algorithm.

The DjCaNE components are important in performance. However, each combination of components are still better than other competitor methods mentioned in Tables 2 –4. This comparison proves that the main performance gap between random walk-based methods and complex methods, such as GraphSAGE, originated from neglecting node features. Therefore, even a simple concatenation of these data can improve them significantly.

Table 6.

Performance of DjCaNE with Alternative Modules.

Component/metric		Cora		Citeseer		PubMed
		Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1
GuidedWalk	AE	86.41	88.17	75.17	78.24	87.00	87.09
GuidedWalk	PCA	86.59	88.61	75.32	78.63	87.68	87.75
DeepWalk	AE	86.00	85.78	67.53	71.65	85.86	86.12
DeepWalk	PCA	84.98	85.90	68.09	72.32	85.87	86.13
Node2Vec(0.25,0.75)	AE	85.08	84.60	67.94	72.04	85.11	85.46
Node2Vec(0.25,0.75)	PCA	84.86	84.47	68.78	72.85	85.11	85.46
Node2Vec(0.50,0.50)	AE	84.18	84.92	67.97	72.21	85.06	85.44
Node2Vec(0.50,0.50)	PCA	84.23	84.04	69.00	72.93	85.04	85.43
Node2Vec(0.75,0.25)	AE	83.92	84.67	68.82	72.75	85.16	85.56
Node2Vec(0.75,0.25)	PCA	84.64	84.60	68.73	72.80	85.14	85.55

AE: Auto-encoders; PCA: Principal Component Analysis.

4.5. Time complexity

Time complexity is another aspect that must be considered in representation learning methods. Table 7 summarises the complexity of each algorithm. In Table 7, $V$ is the set of nodes, $E$ is the set of edges, $d$ is embedding dimensions, $F$ is feature vector dimensions, and $C$ is the set of classes. In TADW, $M = W^{T} H$ approximates $(A + A^{2}) / 2$ , where $A$ is adjacency matrix, by decomposing it to $W \in R^{d \times | V |}$ and $H \in R^{d \times | V |}$ which costs $O (| V |^{2})$ [16], and $nnz (.)$ is the count of non-zero elements in matrix. In GraphSAGE, transductive learning with pooling layer is selected that significantly improves its accuracy. However, it effects complexity. DjCaNE is a framework that consists of three components, and its complexity changes with different components. The general complexity formula is the summation of the NE component complexity (e.g. DeepWalk, Node2Vec, or GuidedWalk), the CE component complexity (e.g. PCA or AE), and the classification complexity $O (2 d | V | | C |)$ . Table 7 considers the complexity of a DjCaNE based on GuidedWalk, PCA, and SVM, as this combination shows the best performance in Table 6. In DeepWalk, Node2Vec, and GuidedWalk, $t$ is the number of random walks from each node, $γ$ is the random walk length, $w$ is the window size, and $Q$ is the negative sampling of the Skip-gram algorithm. Skip-gram’s complexity differs based on its algorithm, using hierarchical softmax or negative sampling. For context embedding, PCA and AE complexity is calculated. Considering that $| V |$ and $| E |$ are substantial compare to other parameters, Table 7 demonstrates that TADW and GraphSAGE are $O (| V |^{2})$ , while DjCaNE complexity is linear to the size of network.

Table 7.

Complexity of baselines and DjCaNE sub-modules.

Method	Complexity
TADW	$O (\| V \|^{2} + nnz (M) d + \| V \| Fd + \| V \| d^{2})$
GraphSAGE (transductive)	$O (Q \| E \|^{2} / \| V \| d (\| V \| + F + \| C \|))$
DjCaNE	$O (γ \| C \| \| E \| + \| V \| γ t + Q \| V \| γ twd + \| F \|^{2} \| V \| + \| F \|^{3} + 2 d \| V \| \| C \|)$
DeepWalk	$O (\| E \|) +$ Skip-gram
Node2Vec	$O (\| E \|^{2} / \| V \|) +$ Skip-gram
GuidedWalk	$O (γ \| C \| \| E \| + \| V \| γ t) +$ Skip-gram
PCA	$O (\| F \|^{2} \| V \| + \| F \|^{3})$
AE	$O (d \| F \| \| V \|)$
Skip-gram (hierarchical softmax)	$O (\log (\| V \|) \| V \| γ twd)$
Skip-gram (negative sampling)	$O (Q \| V \| γ twd)$

AE: Auto-encoders; PCA: Principal Component Analysis.

In the graph domain, the scalability of the convolution-based methods is limited as they work on the normalised Laplacian matrix. The adjacency matrix order is $O (| V |^{2})$ and cannot be handled for large networks. The message passing framework works on non-zero elements of the adjacency matrix and lowers the complexity – GraphSAGE samples the neighbourhood to limit the complexity even more. Note that inductive GraphSAGE is $O (Q | V | \sum_{i = 1}^{K} S_{i} | d_{i} | | d_{i - 1} |)$ , where $K$ is the number of layers, $S_{i}$ is sample size of layer $i$ , and $d_{i}$ is embedding dimensions of layer $i$ , $d_{0} = F$ , and $d_{K} = | C |$ . This model of GraphSAGE is comparable in complexity and has much lower accuracy. It should be mentioned that even some of the random walk-based methods like SaC2Vec are computationally expensive, and their complexity is $O (| V |^{2} F)$ .

5. Conclusion and future works

GNNs have outperformed random walk-based methods. The main advantage of GNNs over the random walk-based methods is their feature smoothing. The state-of-the-art random walk-based methods, however, cannot use node features. Therefore, the fundamental question is how much performance gap comes from node features. This work presents DjCaNE, a new framework for node classification that combines topology and embedding content using separate modules. In contrast to GNNs that propagate and combine features across the graph topology. DjCaNE uses a deep model to embed nodes features and embeds topology independently. Finally, concatenating these two representations makes the final representation of each node.

Although there is no limitation for algorithms in DjCaNE modules, DjCaNE used GuidedWalk, an extension of DeepWalk to embed nodes by considering labelled nodes, and a two-layer AE to embed nodes’ content. Experiments on three citation networks show that DjCaNE outperforms state-of-the-art models in real-world citation networks and has superior performance while performing with lower complexity. It is also important to mention that DjCaNE complexity is closer to DeepWalk than deep learning models. Then it can be considered a successor of random walk-based methods. The present research contribution and findings are classified as follows:

The superior performance of GNNs and MFs over random walk-based models is generally due to considering node features. DjCaNE is a framework that concatenates node context and features embedding to close this gap with lower complexity.

DjCaNE is not component dependent. Experiments on different components and datasets show that it significantly improves all random walk-based methods and outperforms GNN and MF models.

DjCaNE uses unsupervised feature embedding components and component level training rather than end-to-end training, which helps control over-fitting and over-smoothing. Therefore, individual components can be studied in more detail and make the final model more understandable.

Despite the superior performance of GNNs and the focus of studies on them, the proposed framework shows that other methods can perform as well as GNNs in graph machine learning. Therefore, studies on alternative methods are a promising future direction. Furthermore, the author’s future work is combining scalable graph algorithms in GNNs to improve their performance and efficiency. Although DjCaNE outperforms GraphSAGE, using no neighbourhood feature by DjCaNE (it considers neighbourhood nodes without features) may limit its performance in other datasets. The DjCaNE can be extended by considering neighbourhood nodes’ final embedding. This extension results in shallow neural networks that can capture long-range relations like DeepWalk. To this end, DjCaNE must use a single convolution layer in the combination layer to aggregate nodes’ content and context efficiently. This convolution must take in the compact representation of features to keep the model scalable. Finally, short-range feature smoothing with lightweight sparsified random walk-based kernels and long-range contrastive learning are keys to future light and efficient models in the graph domain.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Saeedeh Momtazi

Notes

References

Tang

. Deep learning on graphs. Cambridge: Cambridge University Press, 2021.

Hamilton

. Graph representation learning. Synthesis lectures on artificial intelligence and machine learning series. Center Moriches, NY: Morgan & Claypool Publishers, 2020.

Duvenaud

Maclaurin

Iparraguirre

, et al. Convolutional networks on graphs for learning molecular fingerprints. In: Cortes

Lawrence

Lee

, et al. (eds) Advances in neural information processing Systems (NIPS). Red Hook, NY: Curran Associates, Inc., 2015, pp. 2224–2232.

Wang

Zhang

, et al. Knowledge-aware graph neural networks with label smoothness regularization for recommender systems. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Anchorage, AK, 4–8 August 2019, pp. 968–977. New York: ACM.

Seki

. Cross-lingual text similarity exploiting neural machine translation models. J Inf Sci 2020; 47: 404–418.

Hamilton

Ying

Leskovec

. Inductive representation learning on large graphs. In: Guyon

Luxburg

Bengio

, et al. (eds) Advances in neural information processing systems (NIPS). Red Hook, NY: Curran Associates, Inc., 2017, pp. 1024–1034.

Huang

. Loops in publication citation networks. J Inf Sci 2019; 46: 837–848.

Kleminski

Kazienko

Kajdanowicz

. Analysis of direct citation, co-citation and bibliographic coupling in scientific topic identification. J Inf Sci 2020; 48: 349–373.

Lai

Liu

, et al. Recurrent convolutional neural networks for text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI’15), Austin, TX, 25–30 January 2015, pp. 2267–2273. New York: ACM.

10.

Gong

Shi

Niu

. Hierarchical text-label integrated attention network for document classification. In: Proceedings of the 3rd high performance computing and cluster technologies conference, Guangzhou, China, 22–24 June 2019, pp. 254–260. New York: ACM.

11.

Kim

Seo

Cho

, et al. Multi-co-training for document classification using various document representations: Tf–idf, lda, and doc2vec. Inf Sci 2019; 477: 15–29.

12.

Zhang

Zhao

LeCun

. Character-level convolutional networks for text classification. In: Cortes

Lawrence

Lee

, et al. (eds) Advances in neural information processing systems (NIPS). Red Hook, NY: Curran Associates, Inc., 2015, pp. 649–657.

13.

Grover

Leskovec

. node2vec: scalable feature learning for networks. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, 13–17 August 2016, pp. 855–864. New York: ACM.

14.

Kajdanowicz

Tagowski

Falkiewicz

, et al. Incremental learning in dynamic networks for node classification. In: European network intelligence conference, Duisburg, 11–12 September 2017, pp. 133–142. Cham: Springer.

15.

. Learning deep neural networks for node classification. Exp Syst Appl 2019; 137: 324–334.

16.

Yang

Liu

Zhao

, et al. Network representation learning with rich text information. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI’15), Buenos Aires, Argentina, 25–31 July 2015, pp. 2111–2117. Palo Alto, CA: AAAI Press.

17.

Yao

Mao

Luo

. Graph convolutional networks for text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), vol. 33, Honolulu, HI, 27 January–1 February 2019, pp. 7370–7377. Palo Alto, CA: AAAI Press.

18.

Pan

Chen

, et al. A comprehensive survey on graph neural networks. IEEE T Neur Netw Learn Syst 2020; 32(1): 4–24.

19.

Goyal

Ferrara

. Graph embedding techniques, applications, and performance: a survey. Knowl-Bas Syst 2018; 51: 78–94.

20.

Cai

Zheng

Chang

KC-C

. A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE T Knowl Data Eng 2018; 30(9): 1616–1637.

21.

Perozzi

Al-Rfou

Skiena

. Deepwalk: online learning of social representations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, New York, 24–27 August 2014, pp. 701–710. New York: ACM.

22.

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013, https://arxiv.org/abs/1301.3781

23.

Mikolov

Sutskever

Chen

, et al. Distributed representations of words and phrases and their compositionality. In: Burges

CJC

Bottou

Welling

, et al. (eds) Advances in neural information processing systems (NIPS). Red Hook, NY: Curran Associates, Inc., 2013, pp. 3111–3119.

24.

Tang

Wang

, et al. LINE: large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web (WWW), Florence, 18–22 May 2015, pp. 1067–1077. New York: ACM.

25.

Ribeiro

Saverese

Figueiredo

. struc2vec: learning node representations from structural identity. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Halifax, NS, Canada, 13–14 August 2017, pp. 385–394. New York: ACM.

26.

Dong

Chawla

Swami

. metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Halifax, NS, Canada, 13–14 August 2017, pp. 135–144. New York: ACM.

27.

Pimentel

Veloso

Ziviani

. Fast node embeddings: learning ego-centric representations. In: ICLR, Vancouver, BC, Canada, 30 April–3 May 2018.

28.

Cao

. Grarep: learning graph representations with global structural information. In: Proceedings of the ACM international on conference on information and knowledge management (CIKM), Melbourne, VIC, Australia, 19–23 October 2015, pp. 891–900.

29.

Allab

Labiod

Nadif

. A semi-nmf-pca unified framework for data clustering. IEEE T Knowl Data Eng 2017; 29: 2–16.

30.

Qiu

Dong

, et al. Network embedding as matrix factorization: unifying deepwalk, LINE, PTE, and Node2vec. In: Proceedings of the eleventh ACM international conference on web search and data mining, Marina Del Rey, CA, 5–9 February 2018, pp. 459–467. New York: ACM.

31.

Tang

Mei

. Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’15), Sydney, NSW, Australia, 10–13 August 2015, pp. 1165–1174. New York: ACM.

32.

Kipf

Welling

. Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR), Toulon, 24–26 April 2017.

33.

Veličković

Cucurull

Casanova

, et al. Graph attention networks. In: International conference on learning representations, Vancouver, BC, Canada, 30 April–3 May 2018.

34.

Leskovec

, et al. How powerful are graph neural networks? In: International conference on learning representations, New Orleans, LA, 6–9 May 2019.

35.

Fey

Lenssen

. Fast graph representation learning with PyTorch Geometric. In: ICLR workshop on representation learning on graphs and manifolds, 2019, https://rlgm.github.io/papers/2.pdf

36.

Wang

Zheng

, et al. Deep graph library: a graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv: 1909.01315, 2019.

37.

Keikha

Rahgozar

Asadpour

. Deeplink: a novel link prediction framework based on deep learning. J Inf Sci 2018; 47: 642–657.

38.

Lai

K-H

Chen

C-M

Tsai

M-F

, et al. Navwalker: information augmented network embedding. In: 2018 IEEE/WIC/ACM international conference on web intelligence (WI), Santiago, Chile, 3–6 December 2018, pp. 9–16. New York: IEEE.

39.

Bandyopadhyay

Kara

Biswas

, et al. Sac2vec: information network representation with structure and content. arXiv, abs/1804.10363, 2018, https://arxiv.org/abs/1804.10363

40.

Levy

Goldberg

. Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems 27 2014, https://proceedings.neurips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf

41.

Gregory

. Finding overlapping communities in networks by label propagation. N J Phys 2010; 12(10): 103018.

42.

Kremer

Mansour

Perry

. Implementing the ‘wisdom of the crowd’. J Pol Eco 2014; 122(5): 988–1012.

43.

Sen

Namata

Bilgic

, et al. Collective classification in network data. AI Mag 2008; 29(3): 93.

44.

Namata

London

Getoor

, et al. Query-driven active surveying for collective classification. In: 10th international workshop on mining and learning with graphs, vol. 8, Brussels, 1 July 2012.

45.

Yang

Cohen

Salakhudinov

. Revisiting semi-supervised learning with graph embeddings. In: Balcan

Weinberger

(eds) Proceedings of the 33rd international conference on machine learning, vol. 48 of proceedings of machine learning research, New York, 20–22 June 2016, pp. 40–48. Groveland, CA: PMLR.