Abstract
Machine learning approaches have become a crucial tool in graph analysis. Despite the accurate results of the existing approaches, most of them are not scalable enough to be used in real-world problems. Networks provide two different kinds of information, nodes contents and nodes relations (network structure). Training deep graph neural networks (GNN) over large-scale graphs is challenging due to the limitation of the message passing framework. Graph Convolutional Networks (GCN) work on all node neighbours at once. Furthermore, it is usual to transform node features with a deep neural network before the GC operation. Therefore, the deep transform operation may apply up to hundreds of times for each target node which is heavy computation and hard to batch. This paper presents an abstract framework with two embedding components, the first component embeds node relations, and the second one embeds node contents. The model makes predictions by aggregating these embeddings through a combination component. The presented approach limits the deep transform only to the target node and uses random walk-based embedding instead of the GC operator to reduce the cost. The main goal of the proposed approach is to provide a light framework for the task. To this aim, node relations are embedded based on node neighbourhood structure by a biased variant of the DeepWalk model, called GuidedWalk, and an autoencoder embeds node contents. The experimental results on three well-known datasets show the superiority of the proposed model compared to the state-of-the-art GraphSAGE and TADW models with less computational complexity. On the Citeseer, Cora, and PubMed datasets, the model has achieved 3.23%, 0.88%, and 7.63% improvement in Macro-F1 and 3.25%, 0.7%, and 6.34% improvement in Micro-F1, respectively. Although GNNs are state-of-the-art models, considering node content is their main advantage. This paper shows that even a simple integration of node content to available random walk-based methods improves their performance up to GCNs without increasing the complexity.
1. Introduction
Graph representation learning is the process of finding latent vector representations of a part of a graph, including nodes, edges, and subgraphs [1]. Many previous applications encode nodes as low-dimensional vectors that summarise their graph position, the structure of their local graph neighbourhood [2], and their content. Graphs representation learning is applied in many real-world problems such as molecular fingerprints [3], recommender systems in online markets [4], citation networks classification, machine translation [5], online reviews analysis, community question answering, and social networks.
Graph representation learning for node embedding can be categorised into three categories based on the information they process. The first category does not consider node relations and performs analysis only based on node contents. The second one only uses entity relations and graph theories. The third category combines contents and relations data to reach state-of-the-art performance. Top performance methods are based on convolution neural networks operating on the graph Laplacian matrix (Laplacian matrix size is
The importance, size, and growth of publication networks motivated researchers in recent years to focus on this dataset [7,8]. Document classification [9] is one of the basic tasks in text mining and information retrieval, which has also been used to address the problem. Many of the previous works focused on either textual data [10–12] (equal to first category) or graph-based models [13–15] (equal to second category) to solve the classification problems. Applying joint models (equal to the third category) has received researchers’ attention in recent years [16,17], where publications’ contents and their citations are used jointly to improve the classification accuracy. While most of the joint models depend on GC [18], the proposed framework separates node content data from graph structure to make it simpler. The proposed framework is evaluated over three well-known citation networks, namely Citeseer, Cora, and PubMed, in the task of publication classification. Experimental results show that aggregation of components can result in a competitive performance while the model has fewer parameters to train and can be generalised better. This resulted in a new framework Disjoint Contnet and Network Embedding (DjCaNE) which has three components: a Network Embedding (NE) component, a Content Embedding (CE) component, and a combination component to do the final prediction.
The contributions of DjCaNE are as follows:
A new lightweight, scalable graph embedding framework is proposed that can be trained component by component with final superior performance. While previous models are based on manual feature engineering or end-to-end deep learning architecture, the proposed model consists of aspect-oriented modules that can be trained component by component. Therefore, the proposed framework achieves superior performance without facing either manual feature engineering or the complex training process of deep models.
The framework can extend random walk-based methods with a very low cost to achieve results compatible with scalable deep learning models.
An unsupervised content embedding is presented that improves the classification performance even without any advanced knowledge about data. Furthermore, the proposed framework is general and can be used with any content data, such as images, videos, and even a media sequence (e.g. Instagram pages).
The rest of the paper is organised as follows: Section 2 reviews related works for node classification problem, with a focus on publication classification in citation networks. The proposed DjCaNE is presented in Section 3, and evaluated in Section 4. Section 5 summarises DjCaNE model and future works.
2. Related works
This section reviews the main studies in network embedding. Traditional network embedding algorithms required manual feature engineering by system experts. The representation learning methods, however, automatically extract features. Representation learning methods can be categorised in different ways. This section focuses on machine learning-based methods and divides them into two categories based on whether they use a random walk strategy. Goyal and Ferrara [19] and Cai et al. [20] presented detailed surveys on the network embedding problems and solutions.
2.1. Random walk-based methods
DeepWalk [21] is the pioneer in machine learning-based network embedding. Inspired by word embedding in natural language processing, DeepWalk sets up a limited number of random surfers from each node and considers the trace of each random surfer as a sentence. Then, it uses a language model, called Skip-gram [22,23], to embed nodes. In random walk-based methods, random surfing starts at a node chosen by a uniform probability. Then, the random surfer decides to move to a neighbour node or teleport to other nodes based on a transition strategy,
where
Tang et al. [24] proposed the Large-scale Information Network Embedding (LINE), which embeds adjacent nodes and second-order proximity of each node separately. The final embedding in LINE is the concatenation of both embeddings. While DeepWalk considers uniform transition probability distribution between two adjacent nodes, Grover and Leskovec [13] defined Node2Vec with a different sampling strategy which is based on breadth-first search and depth-first search algorithms. Many researchers followed DeepWalk and defined other strategies based on this technique. Ribeiro et al. [25] proposed Struct2Vec, a model that uses random walks to embed nodes such that structurally similar nodes are placed close to each other in the embedding space. Dong et al. [26] used meta paths to ease heterogeneous network embedding. Random walk-based methods show a good performance in network embedding, and they can scale up to large graphs. These algorithms’ complexity is equal to the sum of random surfers complexity and Skip-gram complexity which is bounded by
2.2. Non-random walk-based methods
The first Non-Negative Matrix Factorization (NMF)-based methods were presented years before random walk-based approaches. Although they achieved good performance, they are not as scalable as random walk-based methods. Therefore, the recent advances in this category are integrating the random walk-based methods concept. Equation (2) shows the basic concept of NMF-based graph embedding
where
A graph is connected with matrices which are
Kipf and Welling [32] designed convolutional neural networks to operate directly on graphs. Their work provided the basis for further research studies. GraphSAGE [6] embeds nodes in a multi-stage deep structure in which, at each stage, node embedding is concatenated with an aggregated embedding of node neighbourhood and is passed forward to the next stage. GraphSAGE is one of the few works that use inductive learning. Veličković et al. [33] used a multi attention mechanism for neighbourhood aggregation. Xu et al. [34] generalised the aggregation process in convolutional neural networks and showed that the sum operator is the most powerful aggregator.
PyG [35] and DGL [36] are famous libraries based on message passing framework to implement graph neural networks. Each stage of the message passing framework is theoretically shown in equation (3)
where
Message Passing Algorithm.
The main drawback of all mentioned methods is their complexity or incompleteness; on the one hand, complex methods cannot be performed on conventional processing machines. On the other hand, lightweight, scalable models, such as DeepWalk, do not consider rich node content information for embedding. Therefore, a scalable machine learning framework is still an open research question [37]. Random walk-based methods are scalable and fast, but they cannot use node content in the embedding process. A limited number of research studies, such as NavWalker [38] and SaC2Vec [39] offer tricks to merge node content in random walk-based embedding. Nevertheless, it makes the model as complex as deep learning models without improving performance significantly.
This study separates network structure from node content embedding, resulting in a simpler architecture than joint frameworks. It is possible to utilise any algorithms mentioned above in the presented framework to embed the network. Since random walk-based models cannot use node data and deep learning methods are computationally expensive, the proposed DjCaNE framework uses two lightweight components in parallel to embed graphs with high quality and low complexity. Experimental results show that the proposed framework extends random walk-based methods efficiently to achieve accuracy near to deep methods.
3. DjCANE
In designing machine learning models, a general rule of thumb is that joint optimization and end-to-end models achieve better performance. Joint models, however, are computationally expensive, especially on large-scale graphs. Furthermore, it is hard to train many parameters; that is, the training process needs more training examples to overcome the over-fitting problem. In graphs, the node’s content and neighbourhood contain information, and a model can advantage both/one of them to predict the label. The prediction must be more accurate if the model uses both information. Joint matrix factorization [16] and GCN [32] are two possible solutions for integrating them. However, their performance improvement is always compared to methods of the same category or the methods that cannot process one of the information sources. For example, consider the DeepWalk method; it cannot process node content and is outperformed by GNNs, which is not a fair comparison. This paper provides a way to add node information to any method like DeepWalk and measures how it improves the results. Interestingly, this fills the performance gap.
DjCaNE is based on components that can be trained separately. It captures two embedding components called NE and CE. NE embeds the relation information of nodes, for example, citation information of publication, and CE embeds node content, for example, textual information of publication. A third component aggregates results and makes final predictions based on the output of NE and CE. DjCaNE is modular in architecture and training. It is trained component by component to avoid complexity in the model. As a result, DjCaNE is very light and does not require any strategy to handle over-fitting. The component-based architecture is more interpretable and makes the model flexible for other node content types such as media. Figure 1 shows the structure of the DjCaNE. The NE module contains a random walk-based network embedding method (e.g. DeepWalk or Node2Vec). These models run many random surfers on the graph and embed the traces with the Skip-gram language model. Therefore, the NE component embeds proximity-based similarities in a

The architecture of the DjCaNE model.
The following subsections describe the detailed structure of each module. It has to be mentioned that each module can be replaced with any other network/content embedding algorithms. This work uses simple neural models and keeps the model as simple as possible to show the capability of the proposed framework in node classification tasks without complex processing. The proposed implementation can be used as an extension of all random walk-based network embedding algorithms with shallow neural networks that improve them without increasing the complexity.
3.1. NE component
The architecture is not limiting modules, and NE can be any embedding algorithm that can learn a low-dimensional representation of nodes in a network. Random walk-based methods, such as DeepWalk and Node2Vec, which learn network topology, fit the module NE as they do not consider node attributes, which are placed in a separate module in DjCaNE. Since evaluation is done on classification tasks, this work uses an improved version of DeepWalk, named GuidedWalk. This random walk-based algorithm only uses link information and is optimised for node classification tasks. Random walk-based algorithms use Skip-gram to embed nodes using random surfers traces. Moreover, Skip-gram is designed to assign similar vectors to nodes that have been seen near together [22,23,30,40]. Consequently, the random surfer propagates latent features through the surfing process. Therefore, the base idea of GuidedWalk is defined as follows:
Given a partial node labelling function
Random walk strategy does not use label counts directly. Instead, it just uses predicted labels to generate paths and embed the nodes,
Random walk strategy does not only use adjacent nodes. It uses
Random walk strategy uses many random walks, which removes the effect of the initial state and processing order.
Considering the above assumptions, equation (4) sums up transition probabilities of random walk strategy in GuidedWalk
where
It is important to mention that although equation (4) calculation is very time-consuming, they are repetitive and can be preprocessed. Then, GuidedWalk uses memorization techniques instead, as many of the cases will not happen. The GuidedWalk memory complexity is
3.2. CE component
CE is the content embedding component in the DjCaNE architecture. Generally, node content has high dimensions. CE aims to learn low dimension representation to be processed in an aggregator module. Auto-encoders (AE) are automated nonlinear dimensionality reduction tools that DjCaNE utilises for the CE component. One of the advantages of AEs is that they can encode any metadata available in addition to media content.
The proposed GuidedWalk explores the graph using paths consistent with the partially observed labelling function. Many successful studies use node content to improve the results as well. Topology-based node embeddings are behind the models that also benefit from node contents. In the CE module, DjCaNE takes advantage of content, textual data in the present paper, since the main goal is to keep DjCaNE lightweight and practical, a two-layer AE is used.
Consider content matrix
where
3.3. Combination and classification
The Combination module concatenates previous modules and forwards them to its internal classifier. Having an
For classification, two different classifiers are used: (1) Support Vector Machine (SVM) as a conventional supervised learning algorithm, and (2) Multi-Layer Perceptron (MLP) as a deep neural network model. The two alternatives can show the flexibility of DjCaNE framework in using different algorithms in each module. The complexity of SVM is
4. Experimental results
This section evaluates the proposed DjCaNE framework against state-of-the-art algorithms over three datasets.
4.1. Datasets
To evaluate performance of DjCaNE model, three well-known citation datasets that previously used by TADW [16], namely Citeseer, Cora, and PubMed 2 are used. The two former datasets are presented by Sen et al. [43], and the latter one is presented by Namata et al. [44] which are well-studied in literature. The goal of the experiments is to determine the category of each publication, having a portion of them labelled as the train set [45]. All datasets are citation networks where publications are presented by nodes and citations are presented by links. The bag of words model describes the content of each publication (node) in datasets. The size of the dictionary and network vary from one dataset to another. Table 1 shows general attributes of these datasets. The Citeseer dataset that includes publications in computer and information science has six classes: Human Computer Interaction, Machine Learning, Multi-Agent Systems, Artificial Intelligence, Information Retrieval, and DataBase. The Cora dataset that includes publications in machine learning consists of 7 classes: Neural Networks, Rule Learning, Reinforcement Learning, Probabilistic Methods, Theory, Case-Based, and Genetic Algorithms. The PubMed dataset that includes publications about diabetes has three classes: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. Table 1 defines the Sparsity metric, which is the number of observed links against the number of missing links; that is, the ratio of nonzero elements to zeros in the adjacency matrix of the dataset.
Statistics of Cora, Citeseer, and PubMed data-sets. The number of nodes in the datasets represents the number of documents, and the number of edges is the number of citations.
4.2. Evaluation metrics and baselines
Following the previous studies in the field, F1 measures (micro and macro) are reported for each experiment. DjCaNE performance is evaluated against the following state-of-the-art approaches:
DeepWalk [21]: DeepWalk uses short uniform random walks to learn representations for vertices in graphs.
TADW [16]: TADW uses a matrix factorization equivalent of DeepWalk and associates it with text features. TADW is one of the examples that use joint optimization to embed nodes.
GraphSAGE [6]: GraphSAGE is a decent algorithm for network classification and link prediction. GraphSAGE is one of few inductive network embedding algorithms. It incorporates node neighbourhoods to improve results. A non-inductive version of GraphSAGE is used, in which the adjacency matrix is fed to the model to have a fair comparison. The non-inductive version of GraphSAGE is a very heavy algorithm but increases the accuracy.
SaC2Vec [39]: SaC2Vec presents a node embedding technique by considering network structure and node content. SaC2Vec builds a multi-edge graph and embeds this graph. The problem with this model is the complexity of building the second layer of the graph, which is
4.3. Experimental setups
In DjCaNE, each of the NE and CE components creates 128-dimensional embeddings. These vectors concatenate to produce 256-dimensional vectors that will be fed to a classifier for prediction. SVM and MLP are used as classifiers. This setting allows comparing a conventional machine learning model with a deep neural model. SVM uses radial basis function kernels with automatic coefficients and no iterations hard limit defined by Python Sklearn. The MLP has three hidden layers with sizes 256, 153, and 76 neurons, which are activated by
The baseline methods have been tested based on their reported parameters in the corresponding literature. DeepWalk and TADW methods are implemented by openNE, 3 and GraphSAGE PyTorch implementation is available at the Github repository. 4
In evaluation, the training set size varies from
Evaluation of DjCaNE on Citeseer dataset.
4.4. Results
4.4.1. DjCaNE performance
Evaluation results over Citeseer are presented in Table 2. As can be seen, DjCaNE outperforms all baseline models, except for the micro-F1 at
Evaluation results over Cora are presented in Table 3. While it is not easy to find a winner when the training set is small, DjCaNE is the absolute winner in larger training sets. The limited performance of DjCaNe on small training data set generally comes from the size of Cora, where the network is small and has a limited textual feature vector. There are difficulties in training NE components and the combination layer when using a smaller portion of the network as training data. In Citeseer, the network is large and textual features are more informative, which can cover the small size of the network in most cases. According to this reasoning, superior performance on PubMed is expected.
Evaluation of DjCaNE on Cora dataset.
Analysing the experimental results on PubMed (Table 4), it is notable that when the graph is large, DjCaNE has perfect results. It significantly outperforms other methods by a large margin. It is worth mentioning that on this dataset, simple methods work better; that is, DeepWalk outperforms TADW and GraphSAGE without considering textual data. The DjCaNE framework is as simple as DeepWalk and considers nodes’ content which helps to outperform all other methods. DjCaNE’s NE component and DeepWalk consider graph higher-order proximities, while TADW and GraphSAGE are limited to first and second-order proximities. In Citeseer and Cora, the dimension of textual data is near to graph size, while PubMed consists of 19717 publications with only 500 features. Due to the TADW factorization formula, it is hard for the model to extract features from PubMed.
Evaluation of DjCaNE on PubMed dataset.
Comparing the performance of DjCaNE-SVM with DjCaNE-MLP on the three datasets, the SVM classifier has better performance on Citeseer and Cora, while the MLP performs the best on PubMed. It is also due to the different sizes of networks. The MLP configuration has much more parameters to learn and works better on large training data. Citeseer and Cora are approximately seven times smaller than PubMed and their training data is not large enough to tune the parameters of a deep neural network. Having more data in PubMed helps the model to show its capability when classifying publications with advanced classifiers.
DjCaNE’s performance against state-of-the-art models restates two points: (1) despite trending deep learning models, there are domains in which well-designed machine learning models can achieve equal or better performance, and (2) joint analysis are not necessarily better than separate components. It is possible to achieve state-of-the-art results by combining accurate disjoint components.
4.4.2. Performance of DjCaNE components
In this experiment, the effectiveness of the DjCaNE is compared to its two base components. Table 5 compares the results of DjCaNE components on different datasets and metrics using 80% of nodes as train data and SVM as the classifier. The tabulated results indicate the following points:
CE and NE components can result in promising results independently, and DjCaNE always shows superior results.
In the classification task, it is not apparent which component is more informative. In the Cora dataset, comparing NE and CE components illustrates that network topology is much more informative. While in other datasets, the results of NE and CE are almost equal.
Simple CE and NE components can be trained on PubMed while TADW and GraphSAGE faced problems in fitting to data due to the architecture complexity.
Effectiveness of DjCaNE components on different datasets and metrics using 80% of nodes as train data using SVM.
NE: Network Embedding; CE: Content Embedding.
4.4.3. Performance of DjCaNE with alternative modules
In this section, experiments are done on different NE and CE modules variations. GuidedWalk, DeepWalk, and three variants of Node2Vec are used in the NE module. PCA and AE are used in the CE module. The results are reported based on the SVM classifier on 80% train rate. Table 6 shows the results of experiments. The main observations of these results are as follows:
The proposed framework is general enough to use any method in the NE and CE components. The main issue is combining a random walk-based model with a content representation algorithm.
The DjCaNE components are important in performance. However, each combination of components are still better than other competitor methods mentioned in Tables 2–4. This comparison proves that the main performance gap between random walk-based methods and complex methods, such as GraphSAGE, originated from neglecting node features. Therefore, even a simple concatenation of these data can improve them significantly.
Performance of DjCaNE with Alternative Modules.
AE: Auto-encoders; PCA: Principal Component Analysis.
4.5. Time complexity
Time complexity is another aspect that must be considered in representation learning methods. Table 7 summarises the complexity of each algorithm. In Table 7,
Complexity of baselines and DjCaNE sub-modules.
AE: Auto-encoders; PCA: Principal Component Analysis.
In the graph domain, the scalability of the convolution-based methods is limited as they work on the normalised Laplacian matrix. The adjacency matrix order is
5. Conclusion and future works
GNNs have outperformed random walk-based methods. The main advantage of GNNs over the random walk-based methods is their feature smoothing. The state-of-the-art random walk-based methods, however, cannot use node features. Therefore, the fundamental question is how much performance gap comes from node features. This work presents DjCaNE, a new framework for node classification that combines topology and embedding content using separate modules. In contrast to GNNs that propagate and combine features across the graph topology. DjCaNE uses a deep model to embed nodes features and embeds topology independently. Finally, concatenating these two representations makes the final representation of each node.
Although there is no limitation for algorithms in DjCaNE modules, DjCaNE used GuidedWalk, an extension of DeepWalk to embed nodes by considering labelled nodes, and a two-layer AE to embed nodes’ content. Experiments on three citation networks show that DjCaNE outperforms state-of-the-art models in real-world citation networks and has superior performance while performing with lower complexity. It is also important to mention that DjCaNE complexity is closer to DeepWalk than deep learning models. Then it can be considered a successor of random walk-based methods. The present research contribution and findings are classified as follows:
The superior performance of GNNs and MFs over random walk-based models is generally due to considering node features. DjCaNE is a framework that concatenates node context and features embedding to close this gap with lower complexity.
DjCaNE is not component dependent. Experiments on different components and datasets show that it significantly improves all random walk-based methods and outperforms GNN and MF models.
DjCaNE uses unsupervised feature embedding components and component level training rather than end-to-end training, which helps control over-fitting and over-smoothing. Therefore, individual components can be studied in more detail and make the final model more understandable.
Despite the superior performance of GNNs and the focus of studies on them, the proposed framework shows that other methods can perform as well as GNNs in graph machine learning. Therefore, studies on alternative methods are a promising future direction. Furthermore, the author’s future work is combining scalable graph algorithms in GNNs to improve their performance and efficiency. Although DjCaNE outperforms GraphSAGE, using no neighbourhood feature by DjCaNE (it considers neighbourhood nodes without features) may limit its performance in other datasets. The DjCaNE can be extended by considering neighbourhood nodes’ final embedding. This extension results in shallow neural networks that can capture long-range relations like DeepWalk. To this end, DjCaNE must use a single convolution layer in the combination layer to aggregate nodes’ content and context efficiently. This convolution must take in the compact representation of features to keep the model scalable. Finally, short-range feature smoothing with lightweight sparsified random walk-based kernels and long-range contrastive learning are keys to future light and efficient models in the graph domain.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
