Abstract
Static node embedding algorithms applied to snapshots of real-world applications graphs are unable to capture their evolving process. As a result, the absence of information about the dynamics in these node representations can harm the accuracy and increase processing time of machine learning tasks related to these applications. Aiming at fill the gap regarding the inability of static methods to capture evolving processes on dynamic networks, we propose a biased random walk method named Evolving Node Embedding (
Introduction
Real-world data and processes are often modeled as graphs, since these abstractions can express the relationships between entities of interest in an intuitive and useful way. Relationships in online social networks ([17]), protein-protein interactions in biological systems ([33]), telco churn prediction ([20]), location-based networks and route graphs in roadmaps ([3,24]), credit card fraud detecction ([29]) are some examples of data intuitively modeled as graphs (also referred to as networks).
However, in many real networks, access to data associated to nodes and edges is often difficult and/or costly. For this reason, it is not uncommon that, at all times, a large fraction of the data we desire to model is unobserved. In most cases, data is acquired through some sort of online search or exploration of the graph, which can be seen as an evolving process that increases the available knowledge about the network as the search progresses. At each step, information about topology, nodes and edges’ is collected, but the full knowledge may never be attained ([21]). Analyzing these networks is essential to extract hidden knowledge and meaningful patterns, since many machine learning tasks on graphs involve making predictions over nodes and edges ([8]. Effective graph analysis can consequently improve both these predictions and the performance of machine learning algorithms, which are highly influenced by the discriminate representation of the information extracted in the observed data ([1].
In this work, we focus on addressing issues that arise with evolving graphs, whose network topology is partially unknown throughout the graph exploration. In this scenario, we start from a small set of nodes and edges and, at each step of the search (i.e. network snapshot), nodes and edges may be added or removed. As an example of evolving graphs, we have the process of spreading fake news on social media. ([34]). As an example of evolving graphs, we have the process of spreading fake news on social media [33]. In this context, imagine that a few Twitter users issue a initial tweet with a fake news. Each of these users would then be modeled as a node in the graph, which is composed only by nodes in the beginning of the evolving process. After some time, other users start to posting or retweeting the same fake news. In this case, users posting the fake news are modeled as new nodes in the network and users retweeting the fake news not only are modeled as new nodes but also create and edge with the users from the original tweet. As the dissemination of the fake news goes on, the graph keeps evolving adding new nodes and edges. As mentioned before, networks evolve over time and capturing this evolution process can be decisive in several applications, as described in the example of fake news spread on Twitter. Therefore, it is essential that models are capable of not only capturing this temporal evolution but also doing so as quickly as possible, even when this dispersion event is still small (and also the spread network generated by this event). By doing this quickly, it is possible to avoid having greater and more effective damage control in these scenarios.
By modeling the way entities in the data relate to each other as a graph, we can extract useful information that cannot be obtained by analyzing the data in isolation. As an example, in PageRank, the importance of structure goes beyond first and second order neighborhoods: the centrality of a node is a recursive function of the entire graph ([2]). In this direction, a slew of machine learning methods have been developed for graph data in the past few years, mostly for tasks involving predictions over nodes and edges ([8]), by improving node and graph learning representations. Effective representations can improve both these types of predictions and the performance of machine learning algorithms, which are dependent on how discriminative is the information extracted from the observed data ([1,5]).
In the context of graph analytics problems graph embeddings were proven to be an excellent alternative for graph representation ([17]). Graph embedding is a framework for building low dimensional representations of the entire graph, or parts of it, e.g. nodes, edges and sub-graphs, while preserving structural information and graph properties ([5]). In this work, we focus on node embedding techniques, which aim to represent each node as a d-dimensional feature vector (d is an input) that accurately captures its relationships to other nodes ([5]).
Seminal node embedding methods focused on applying linear and non-linear transformations to a graph similarity or adjacency matrix, aiming to reduce the high dimensionality of non-relational data modeled as networks. These methods perform poorly on real-world networks due to high computational cost and lack of robustness ([10,27]). More recently, other node embedding approaches for static graphs ([10,23,27]) were proposed in order to learn feature representations for each node scalably and with a more consistent performance. These approaches outperform the seminal methods in all prediction tasks for graphs. Among these methods is node2vec ([10]), a static embedding technique for node representations, is often the best performing for node classification, although it also performs well on other tasks.
node2vec builds on the Skip-Gram model ([18]) by generating “sentences” of nodes through sampling sequences of nodes using a biased random walk, and learning node embeddings via an optimization process. Biased random walks can balance between behaving more as a BFS (breadth first search) or as a DFS (depth first search) depending on how node2vec parameters are chosen, defining a flexible notion of neighborhood ([10]). It is worth mentioning that there are also static node embedding techniques based on recurrent and convolutional neural networks, such as GE ([12]) and SDNE ([32]), but these tend to work better when larger volumes of data are available (which is not a characteristic of the scenarios hereby explored) and are, therefore, out of the scope of this paper.
All of the methods mentioned above assume that the graph topology is fully observable and static (i.e., does not change over time and thus do not consider evolving processes). In the scenario where the graph evolves as it is explored described above, applying one of these methods to generate node embeddings consists of running it on snapshots taken from the network. In this case, each snapshot and respective embedding represents an independent network. This approach disregards the fact that each snapshot may contain relevant information about the next, and that incorporating past information into the learning procedure could improve the accuracy of downstream inference tasks. Moreover, the resources spent in computing previous embeddings are entirely wasted. As a result, this approach leads to decreased performance in machine learning tasks mainly due to the characteristic of adding or removing nodes and edges of growing graphs, which directly impacts on the stability of the generated embedding, since the embeddings can change significantly from one snapshot to the next.
In order to compensate for the shortage of limitations and gaps mentioned before, in this work:
We conduct a comprehensive experimental study to evaluate the quality of the embeddings obtained for an evolving graph by
We evaluate the embeddings according to their performance on the node classification task using four different network datasets. Furthermore, we conduct an ablation study to show the impact of each of the new mechanisms incorportated in our model.
Our results show that our approach generates better embeddings: in a downstream node classification task, our embeddings achieve better performance than node2vec and the other baselines in almost all scenarios, with numbers up to 20% better.
The rest of the paper is organized as follows. In Section 2, we review the related work. We develop our method in Section 3 and report the experimental results in Section 4. We conclude the paper in Section 5.
Related work
Graph embedding techniques have drawn a lot of attention in the recent past, becoming the standard feature engineering paradigm for nodes features for graphs ([1,5]). In essence, they provide low dimensional representations based on large adjacency matrices. These techniques condense information from the nodes and edges of a graph into reduced structures, called embeddings.
Seminal works on graph for generating features for nodes include linear and non-linear dimensionality reduction algorithms, such as PCA, ISOMAP and Laplacian Eigenmaps ([1]), which have been applied to graphs’ adjacency or similarity matrices. These methods became popular due to their relative simplicity and effectiveness. However, they rely on finding eigenvectors of the (often sparse) affinity matrix, which can be extremely costly. The complexity of such methods is at least quadratic on the number of nodes, which does not scale well to large networks ([27]). Moreover, such approaches are not very robust, since applying these features as input to machine learning algorithms could reduce the values of accuracy, in different tasks ([3,8]).
In the last decade, a range of methods were introduced to address the high cost and low robustness of these seminal works by applying non-linear techniques, mainly inspired on random walks and deep learning, to learn representations for nodes in a graph ([10,23,27]). These approaches generate node embeddings which represent each node of a graph as a vector in a low dimensional space. These methods are known to scale well for very large networks, containing up to millions of nodes and billions of edges, and to outperform the first cited methods in many machine learning tasks, such as node classification, link prediction and anomaly detection.
The remainder of this section discusses the state-of-art methods for two classes of node embedding methods: (i) the static node embeddings, which generate a single representation for each node of the graph; and (ii) the dynamic node embeddings, which generate a temporal representation for each node through a series of vectors.
Static node embeddings
The very first methods for static node embeddings were proposed in the past five years, when researchers on this field changed the focus from traditional dimensionality reduction algorithms to scalable graph embedding techniques which leverage sparsity from real-world networks ([8]). Since then, a number of methods for generating embeddings have been proposed, such as HOPE ([22]), SDNE ([32]), and DNGR ([4]). Both SDNE and DNGR are methods for generating static embeddings based on deep learning models. In this case, a reasonably large data network is necessary for the correct training of these models. Additionally, HOPE was built to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. As the scenario in this work involves networks evolving from extremely small graphs (from 20 nodes onwards), these models are not applicable as baselines.
Among the most prominent methods, we select three methods which were used as baselines during our experimental study: DeepWalk ([23]), Large-scale Information Network Embedding (LINE) ([27]) and node2vec ([10]).
The methods described above have been developed for scenarios where the graph topology is completely observable upfront. We call these “static graphic scenarios” because the topology remains fixed. However, it has been shown that most large scale real-world networks are rarely fully available ([21,25]). Furthermore, they may evolve over time, for instance, with the addition or removal of nodes and edges ([7]). The most common way of representing evolving networks is through a sequence of snapshots, each containing the nodes and edges observed until that specific step of the exploration process.
Previous works have applied methods for generating static embeddings to evolving graph datasets. In this case, an embedding is generated for each snapshot of a network, and, at the end, the embeddings are compared to identify possible changes over the exploration of the network ([7,8]). However, in this case, each snapshot is treated by the algorithm as an independent network, not leveraging any information regarding the evolution of the network. In addition, the absence of relevant information related to the temporal evolution of the network impairs the learning procedure and the performance of downstream machine learning algorithms for node classification, link prediction and other tasks ([8,9]).
Dynamic node embeddings
Dynamic node embedding techniques have been proposed as extensions to methods originally designed for static embeddings in order to capture the temporal evolution of real-world networks. The main idea behind these methods is to update the embeddings using information from the previous embeddings along with information from the current state of the graph. By accounting for these two sources of information, these methods attempt to capture the time patterns present in the evolution of the graph. When these patterns get encoded in the embeddings, the performance of downstream machine learning tasks is expected to improve ([7]).
Several methods for generating embeddings in dynamic graphs have been proposed in the past years. Some of them are random walk based methods adapted to work with dynamic networks, markedly, dynnode2vec ([15]), HIN–DRL ([16]), DNE ([6]), tcc2vec ([19]) and tNodeEmbed ([26]). However, such approaches do not outperform the standard node2vec in well-know datasets. Refer to ([30]), for more details. On the other hand, most of the recently methods are based on deep learning techniques, particularly, DynGEM ([9]), dyngraph2vec ([7]), DyRep ([28]), SGNN ([14]) and GraphSAGE ([12]). Although they present competitive results in machine learning tasks when compared to static counterparts, the applicability of deep neural networks and deep autoencoders is restricted to graphs that are large enough to allow for learning a large number of parameters without overfitting. Hence, although they present a good alternative to treat dynamic graphs, they are only feasible for sufficiently large graphs. In the graph exploration scenario, the observed network can be rather small in the beginning. Hence these techniques cannot be applied.
This section introduces the definition of evolving graph used in this paper and describes the proposed model, named Evolving Node Embedding (
Definitions
Let
Given a graph
In particular, we consider evolving graphs that start relatively small and evolve by the addition or removal of new nodes and edges. This process can represent, for instance, social networks where users accept new friends (addition of edges), or neighbors who sometimes remove or delete contacts each time they move to a new neighborhood, or even other networks that are being discovered on-the-fly through a search algorithm.
Proposed method
In contrast to mechanism (i), in the original node2vec, random walks are independently generated for each time step, rendering it unable to capture temporal patterns across graph snapshots. To address this issue, we use both the walks sampled in step
Mechanism (ii) changes the way we start the training phase of Skip-Gram. The original Skip-Gram initializes the learning process from a vector of random weights. Instead, we initialize the weight vectors at time t using embeddings generated at time
node2vec provides a flexible notion of neighborhood through the use of biased random walks that interpolate between a BFS and a DFS-like behavior ([8]). This interpolation is governed by two parameters: the return parameter p controls how far from the source node the walk will go, by defining the likelihood of returning to a node that was visited in the previous step; the in-out parameter q directs the walk towards nodes that are further away from the source node. Random walks are in turn used to define the neighborhoods of the nodes by feeding the sequences of visited nodes to an extension of the Skip-Gram model ([18]). The extended model tries to learn a feature representation f for each node v that minimizes the negative log-likelihood of observing a neighborhood
We modify the original p and q parameters to change their values dynamically, allowing the balance between BFS and DFS biases to change over time. At the beginning, when the graph is small, we weaken the DFS behavior, reinforcing the BFS bias. Conversely, when the observed network becomes larger,
Algorithm 1 presents

When processing snapshot t,
In this section we describe the experiments performed to evaluate
Experimental setup
The proposed method was analyzed using five networks datasets representing undirected and unweighted networks. The classification task is defined by choosing one node subpopulation of interest and defining them as targets and the remaning nodes as non-targets, yielding highly unbalanced classes.
Table 2 summarizes the network characteristics of each dataset below:
Description and basic statistics of each network. “Targets”refers to the subpopulation of interest,
is the number of snapshots,
are the number of nodes and edges in the last snapshot, respectively, and
is the percentage of target nodes in the last snapshot of the network
Description and basic statistics of each network. “Targets”refers to the subpopulation of interest,
The evolving networks were modeled as a sequence of graph snapshots representing the graphs previously described at consecutive observation steps ([13]). We consider snapshots generated by two different network search algorithms: Selective Harvesting ([21]) and a modified version o Breadth-First Search (BFS).
Selective Harvesting (SH) performs a type of online network search. The goal of SH is to maximize the number of nodes found, belonging to a certain target subpopulation, given a partial view of the graph and assuming the cost to query node labels is high. At each step, SH makes a prediction, based on the graph structure observed so far and the queried node labels, about which unlabeled node is more likely to belong to the target subpopulation. It covers scenarios where good temporal embeddings for nodes could boost performance in a given machine learning task.
In this sense, SH resembles a DFS, with the addition of a guiding procedure, which decides which node to query next. Figure 1(left) illustrates two consecutives steps of SH execution. On the left snapshot, four nodes (black) have already been queried for labels. Four other nodes (gray) are known but unqueried and the remaning nodes (white) are still unknown at this point. Solid edges represent the known graph structure in this step, dashed edges are still unseen. In the next snapshot, on the right, a node is queried for its label, also revealing its outgoing edges and neighboring nodes. Nodes above the traced line are included in the current snapshot.
The second algorithm modifies the BFS original search procedure in order to explore one node of the same level at each query. Figure 1(right) illustrates an example of two consecutives steps of BFS procedure execution. On the left, the gray node indicates the node to be explored in the current query. After the query, the explored node turns to black and on the right, the gray nodes indicates the nodes to be query next.

Example of two consecutives steps of selective harvesting (left) and BFS (right) search algorithms. Black, gray and white colors represent queried, unqueried and unknown nodes respectively. Solid and dashed lines represent known and unknown edges respectively.
Figures 2 and 3 show the growth in number of observed nodes and edges, respectively, across snapshots for SH’s and BFS. With either algorithms, we observe that the evolving process in the networks snapshots start with very few nodes and edges for all datasets. Thus, as mentioned before, existing dynamic embedding techniques cannot be applied to this scenario since most of them are based on convolutional and recurrent neural networks. These models require large training sets in order to avoid overfitting.

Node count for each dataset through the snapshots for SH’s (top) and BFS (bottom) search procedure.

Edge count for each dataset through the snapshots for SH’s (top) and BFS (bottom) search procedure.
The number of snapshots varies for each search procedure. As the search begins from a random node, the neighborhood exploration differs for each node. Table 3 presents the number of generated snapshots for both SH’s and BFS. For reproducibility propuses, all graph snapshots used in this work are publicly available.1
Number of snapshots generated for each search procedure
We obtain node embeddings for each of the snapshots of the five networks using four static node embedding models as baselines, namely DeepWalk,
The embeddings were trained with 128 dimensions. DeepWalk and node2vec have three commom parameters, namely, size of the context window and number and length of the random walks. The size of the context window was set to
We tested all variants of our method using the same parameters as node2vec’s. The only difference is the value of the parameters p and q, which are static in node2vec and adaptive in our methods. We designed a strategy for changing these parameters as a function of the knowledge about the network. We define the start values for
Node classification
Our experimental study focuses on the node classification task. We use the node embeddings generated for each snapshot and the corresponding node labels to train a classifier along the graph evolution process. We consider four standard classifiers, namely Adaboost (AD), Logistic Regression (LR), Naive Bayes (NB) and Random Forest (RF). Classifiers were run with their default parameters, since the goal is to compare the embedding techniques. The performance of each combination dataset × embedding technique × classifier was computed over 5 executions of a 10-fold cross-validation procedure, each corresponding to a sequence of snapshots obtained from a different initial node. Since class examples are highly unbalanced, we chose Macro-F1 as our evaluation metric. The F1 measure is the harmonic average of precision and recall. The Macro averaging calculates metrics for each class, and finds their unweighted mean.
Also, to provide a holistic assessment of the embeddings over the entire set of experiments, we compute two evalutation metrics that account for all executions over all datasets used in the experimental study:
Roughly speaking, Mean Rank determines which embedding yields the best results more often. Mean Penalty, in turn, determines which embedding is best taking into account scores. Therefore, an embedding e can achieve the lowest Mean Penalty even if e does not yield the best results for all datasets and classifiers, as long as the score difference to the best embedding is never too large. Hence, Mean Penalty identifies embeddings that are robust and consistent to changes in datasets and classifiers throught the evolution process. By robustness, we mean the method should have small variations of Mean Penalty values across different datasets and classifiers. By consistency, a robust method should also be consistent regardless of changes in the network’s evolution scenario, i.e., the method should be consistent over time.
Next, we present and discuss the results found for the proposed research questions.
RQ1: Among the existing static node embeddings under consideration, which one achieves the best node classification performance in the described evolving network scenarios?
More specifically, using which static methods are we able to achieve the best results in the node classification task? Does node2vec still provide better results than the classical baselines DeepWalk,
Figure 4 presents the results of Mean Rank and Mean Penalty for all 5 datasets and each classifier using DeepWalk,
Figure 4(left) indicates that, for BFS, node2vec outperforms all other methods except at the first search stage. Also, node2vec’s performance is superior to LINE1 and LINE2 in all cases. Figure 4(left) also indicates that, for SH, node2vec outperforms LINE2 in all cases, LINE1 in 2 cases and Deepwalk in 1 out of 4 cases. Particularly, node2vec outperforms LINE2 in all cases, for both SH and BFS. Thus, we conclude that node2vec works best for BFS. For SH, Deepwalk yields competitive to superior results when compared to node2vec.
Figure 4(right) presents the Mean Penalty results for each baseline and search procedure. Note that the methods’ performance is consistent across classifiers and interation steps. From the reported results, node2vec achieves lower values of Mean Penalty in 5 out of 8 cases. Particularly, node2vec’s performance tends to be more robust than the other baselines. In other words, the deviation associated with the baselines’ performance is greater than the deviation associated with the node2vec’ performance. All baselines are consistent over time.
From the results discussed above, we find that node2vec displays great robusteness across datasets and classifiers, and great consistency over time. For this reason, from now on we focus the comparisons of our method with node2vec.

Results of mean rank (left) and mean penalty (right) for all datasets on all classifiers, regarding 4 iteration steps and 2 search methods (BFS and SH). Dw, l1, l2 and n2v represent, respectively, DeepWalk,
We showed in the previous section that node2vec is typically the best among the static methods and since
Figure 5(left) presents the Mean Rank results for the seven combinations of
The results of the BFS indicate that in each iteration step, at least one variation of
From the results in Fig. 5(left), we observe that

Mean rank (left) and mean penalty (right) of node2vec and the seven combinations of
From these analysis, we highlight 3 key facts: (i) any improvement in classification performance results only from the embedding obtained and not from the classifier used, (ii) the mechanisms of
We analyze the Mean Penalty results to assess the consistency and robusteness of all
Figure 5(right) presents the results of Mean Penalty for all
The similar values of Mean Penalty observed at different iteration steps indicate that changes during the evolution of the network do not interfere with the consistency of the method over time. In particular, we observe that our method and its variations are substantially more consistent than the other baselines analyzed. According to Fig. 5(right),
RQ4: Can we quantify and understand the effect of the modifications on node2vec over time?
From Fig. 5, we observe that leveraging information on the network evolution process through each of
Therefore, to awnser RQ4: (i) we analyze how the aggregation of evolving information affects the node classification performance during the exploitation process when compared to node2vec and (ii) we quantify this effect at different points of the evolving process. We show Macro-F1 values for different points in time on Tables 4 and 5.
Numerical results for the BFS search method, on four different iteration steps. In bold the best results of each dataset for each embedding method according to 4 different classifiers
Numerical results for the BFS search method, on four different iteration steps. In bold the best results of each dataset for each embedding method according to 4 different classifiers
Numerical results for the SH search method, on four different iteration steps. In bold the best results of each dataset for each embedding method according to 4 different classifiers
Table 4 shows the Macro-F1 achieved at four different iteration steps, after 25%, 50%, 75% and 100% of the evaluation period for the BFS search process. For each dataset, classifier and snapshot, we highlight in bold the best performing embedding method. We group the analysis of the table according to each snapshot.
Table 5 shows the numbers of Macro-F1 achieved at four different stages (25%, 50%, 75% and 100%) of the evaluation period for the SH search process. For each dataset, classifier and snapshot, we highlight the results of the best performing embedding method. We organize the analysis of the table based on the snapshot (25%, 50%, 75% or 100%).
To provide an overview of

Mean absolute macro-F1 results of node2vec and the seven variations of
The results of Macro-F1 for the BFS search are generally lower than the values of Macro-F1 for the SH search process. In addition, the values of Macro-F1 tend to increase over time for both BFS and SH search.
Also from Fig. 6, we observe that, for both SH and BFS, the variations dpq, dpq+wgt and wgt tend to provide the best results, while the variation all tends to provide lower values of Macro-F1 for both BFS and SH search process.
In this work we proposed
Our proposals to EVNE are generic enough to be applied to all SGNE and Random Walks based methods in a dynamic setting. However, in this work
After an extensive analysis of objective results in different types of datasets, we can conclude that
There are several directions of future work.
Footnotes
Acknowledgements
This work was partially funded by projects EUBra-BIGSEA (H2020-EU.2.1.1 690116, Brazil/MCTI/RNP GA-000650/04), ATMOSPHERE (H2020 777154 and MCTIC/RNP 51119), FAPEMIG (grant no. CEX-PPM-00098-17), MPMG (project Analytical Capabilities), CNPq (grant no. 310833/2019-1), CAPES, MCTIC/RNP (grant no. 51119) and H2020 (grant no. 777154).
