Abstract
Link prediction is an important sub-task in link mining area. This paper discusses link prediction in dynamic networks and proposes a new link prediction method which can learn from the long-term graph evolution of networks. The method first represents the variation of the structural properties in a dynamic network. Then, a classifier is trained for each property. It finally conducts link prediction process using an ensemble result of all the classifiers. Experiments in three realistic collaboration networks show that the evolution information of the network is beneficial for the improvement of link prediction performance and different structural property has different capability to describe dynamics of the network.
Introduction
Link prediction has been an area with intensive research in recent years driven by its ubiquitous application in a variety of scenarios such as collaboration networks, terrorist networks, communication networks and online social interaction. In recommendation system, it can predict links between users and items in a user-item bipartite graph representing preferences or purchases [1]. In bioinformatics area, the prediction of relation and regulation of proteins can guide the experiment designers to discover new biological factors [2]. In citation networks, scholars can find the papers potentially useful to cite and managers can identify future core papers with the help of link prediction [3]. According to Liben-Nowell and Kleinberg [4], link prediction is defined as “Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added to the network during the interval from time t to a given future time t’.”
However, most of the current researches in link prediction adopt the static graph representation where a single snapshot of the network is analyzed to predict hidden or future links [5]. These methods do not consider too much on the long-term graph evolution of networks, while in general the relationships in a network are dynamic and evolve over time. The sequence of snapshots of an evolving graph often carries additional temporal information such as the creation and deletion times of nodes and edges. Recently, time-aware techiniques [6–15] have been used for link prediction in dynamic networks. These methods achieved good results for different application purposes. In this context, we proposed a link prediction method based on machine learning from the variation of the structural properties in dynamic networks.
One of the classic link prediction frameworks [4] is under the supervised learning setup, where a prediction model is trained on the dataset composed of instances (i.e. node pairs) annotated with certain properties. Node pairs in a snapshot of a social network at time t are sampled as training instances. If a node pair forms a new link in the snapshot at the next time step t + 1, it is marked as positive instance. Otherwise, it is a non-positive instance. Different from this approach, our method focuses on the evolution of the network. Firstly, each instance can be described by a number of structural properties in the network. For each property, its values in a series of network snapshots are recorded as a property vector. A classifier is then trained by a set of positive and negative instances represented by the assigned property vectors. Finally, an optimization algorithm is used to weigh all the classifiers in order to get the final prediction.
The rest of the paper is organized as follows: Section 2 describes the background of dynamic link prediction problem. With the brief overview of existing work, the motivation of our work is explained. Section 3 describes the algorithm of the proposed method and how it differs from traditional link prediction methods. Experimental results using three datasets for DyLiP and several comparative methods are reported and discussed in Section 4. Section 5 concludes the paper and proposes some future work to be conducted.
Backgrounds
Dynamic link prediction problem
Recently, dynamic link prediction problem has been discussed by the following researchers and they concluded that temporal information can bring performance gain to link prediction task.
Related works
Sarkar et al. [6] proposed a nonparametric algorithm NonParam for link prediction in dynamic graphs which exhibits important advantages in the presence of nonlinearities such as seasonality patterns. Huang and Lin [7] introduced the time series link prediction and proposed a hybrid link prediction approach utilizing both the intra-link time series patterns and inter-link dependencies. They concluded that time series model can produce significantly improved prediction results than static graph models. Tylenda et al. [8] developed graph-based link prediction techniques that incorporate the temporal information in evolving social networks. Their results demonstrated that timestamps significantly improve the prediction accuracy of new and recurrent links. Sharan and Neville [9] presented an initial approach modeling dynamic relational data graphs in predictive models. They found that when considering dynamic data, their approach outperformed thebaseline snapshot approach. Potgieter et al. [10] extracted valuable temporal trends from the Pussokram online dating network dataset, and used logistic regression to perform link prediction. Their results showed that temporal metrics are a valuable contributor to link prediction. Da Silva Soares et al. [11] built time series for each pair of non-connected nodes by computing their similarity scores at different past time steps. They found that using temporal information, the results of the proposed method were more satisfactory than their traditional counterparts. Bliss et al. [12] proposed an evolutionary algorithm CMA-ES to link prediction in dynamic networks. CMA-ES is used to optimize the best weights for the linear combination of topological similarity indices and similarity indices of individual characteristics. The result showed that some indices are possible to capture the evolution of the network. Krzysztof et al. [13] defined the Triad Transition Matrix to discover and quantify the evolution patterns. Based-on the matrix, they proposed a dynamic link prediction algorithm called TTM-predictor. Richard et al. [14] evaluated the network evolution in an autoregressive framework. Their method outweighed several traditional heuristics methods. Aggarwal et al. [15] studied the problem of dynamic link inference in temporal and heterogeneous information networks. They proposed a two-level scheme which makes macro- and micro-decisions for combining structure and content in a dynamic and time-sensitiveway.
Most of the above methods [6, 9] studied the link prediction problem with structural hypotheses on the graph generation process. The evolution patterns are quantified by regularization and optimization on sparse low-rank adjacent matrix [13, 14] or by regression on graph features [10]. These methods are based on mathematical assumptions and require prior knowledge dependent on network type. Some work [11] performs the prediction of new links by exploring the evolution of topological metrics. But non of them take the variation of topological metrics as training data to learn as our method does. We believe the variation of metrics can be used to model trends in the network dynamics. The unsupervised methods [15] aim to rank the non-linked pairs of nodes and predict the top L pairs of nodes as new links. But how to define the threshold L and how to combine information provided by more than one metric is still a problem. Also, the accuracy of unsupervised methods is normally difficult to win that of supervised methods.
Notations of link prediction in dynamic networks
A dynamic network can be represented by a series of graphs (G1, G1, . . . , G
T
), where G
t
=< V, E
t
> is the network at time t. denotes a set of nodes or data instances (persons, locations, products, et al.) and E
t
denotes a set of observed links at time t. Let Label
t
=< v
i
, v
j
> denotes the label of an arbitrary pair of nodes, <v
i
, v
j
>representing the existence of a link at time t.
Each G t can be represented as a symmetric N × N adjacency matrix M t , where N is the total number of nodes in V. In M t , each element M t (v i , v j ) is the value of Label t =< v i , v j > or a frequency of edge occurrences of <v i , v j > attime t.
The link prediction task is to predict how likely an unobserved link e ij ∉ E t exists between <v i , v j > in the future network (for example at time T + 1). An N × N score matrix ST+1 is calculated, where each element ST+1 (v i , v j ) is a link score denoting the predicted link occurrence probability of <v i , v j > at time T + 1.
Time-aware properties vs. temporal properties
A property or a metric is usually calculated from a network graph and it can describe a relationship between <v i , v j >. For example, the number of common neignbors and the shortest path are always used as structural properties. Several temporal metrics such as Recency, Return and Moving Averages were used in link prediction [10] in order to improve the traditional methods based on structural properties. Actually, these temporal properties are calculated in one snapshot of network so they are generally used in static methods rather than in dynamic methods.
In a dynamic network, the property value could change upon time and therefore we call them time-aware properties. Let denotes a set of structural properties describing node pairs in the network. X k < v i , v j > t denotes the value of the k-th property of <v i , v j > at time t. ΔX k < v i , v j > t denotes the variation of X k < v i , v j > between two adjacent time slices t and t + 1 for <v i , v j >. is a sequence of ΔX k < v i , v j > t , where the value of t is in the interval [1, T - 1]. As a result, each <v i , v j > corresponds to a K × T matrix , where each value is X k < v i , v j > t . Moreover, <v i , v j > also corresponds to a K × (T - 1) variation matrix , where each value is ΔX k < v i , v j > t .
Learning in dynamic networks
Link prediction is regarded as a binary classification problem from the machine learning point of view [16]. The linked node pairs are marked as positive instances and the selected unlinked node pairs are marked as negative instances. A binary classifier is trained with these instances and can later determine the likelihood of an edge between node v i and v j that are currently unlinked. The classifier is usually based on a learning model like decision tree, k-NN, RBF network, SVM et al. Among them, SVM shows the best performance according to Al Hasan [16]. It is noticed that most of the supervised learning based link prediction methods were developed for static networks and focused on the selection of structural properties.
Currently, learning techniques have not been extensively used in dynamic link prediction. The work of Da Silva Soares et al. [11] is most relevant to ours in this paper. Their approach first builds time series for each pair of non-connected nodes by computing its topological metric scores at different past time. Then, a prediction model is used to get the scores in the next time step, and a hybrid vector is formed to train the classifier. Both the supervised and the unsupervised link prediction methods are used. It is noticed that the classifier model in their method is trained from non-realistic predictive values. Different to their method, our method pays more attention to the evolutionary process of the network. The variation of the structural properties is studied and trained by the classifier, which can better describe the evolution of the network. Besides, our method can predict new links as well as recurrent links in future.
The learning model in our method is flat SVM. However, it is not used to train the property values at each time step as used in the traditional link prediction methods, but to train the variation of property values before nodes are linked at some time-point. If the properties of an unlinked node pair demonstrate the variation compliant to the model learnt, a new link will be more likely formed.
The proposed method DyLiP
In this paper, a new link prediction method DyLiP (a Dynamic Link Prediction approach) is proposed for dynamic networks. DyLiP captures the variation of several structural properties in the network under a supervised learning framework. First of all, a series of the network evolutionary graphs (G1, G2, . . . G T ) is generated from the original network data. For an arbitrary node pair <v i , v j >, the change values of all its properties are calculated from the graphs and finally complete the matrix . For each property X k , one classifier is trained using a set of instances represented by and their labels Label T < v i , v j >. Positive and negative training instances can be selected from all the instances either in a random way or using a selective sampling technique. Finally, the prediction can be achieved by weighting all the classifiers to get an ensemble result. The weight of each classifier can be obtained by an optimization algorithm.
Different to the traditional static link prediction methods [16] based on supervised learning, a training instance in DyLiP is not the vector of all the properties describing a node pair, but a vector concerning one property. Different to the time-series based link prediction [11], the vector is not composed of the property value in each time step but the variation of the property between adjacent time steps. The most distinguishing feature of DyLiP is that through learning it could find the intrinsic relationship between the variation of structural properties and the formation of links.
Algorithm
The pseudo code of DyLiP method is listed as follows.
For X k in ∥
For each <v i , v j > in the train set
Calculate X k < v i , v j > 1, X k < v i , v j > 2, . . . , X k < v i , v j > T-1 ∥
Get∥ End For∥ Generate a new training set ∥ Train with a classification model∥ Get a classifier
End For
Optimize Label T < v i , v j >) 2 and get vector
Given an arbitrary <v i , v j >, predict the link possibility with
In the code description, denotes a training set generated for each X k , which is a sequence of variation values of the k-th property for <v i , v j >. , where i, j ∈ {1 . . . N}. is a classifier learnt by training all the instances in . Vecw = (w1, w2, . . . , w k , . . . , w s ) is a vector of weights, where w k is the weight assigned for.
The definition of ΔX k < v i , v j > t is important. In our work, it can be described as either
1. the difference
ΔX k < v i , v j > t |diff = X k < v i , v j > t+1 - X k < v i , v j > t or
2. the rate-of-change
In practice, X k < v i , v j > t is not necessary to be calculated from the beginning (i.e. t = 1), as the value of a structural property usually begins to change at a later time step. Therefore, a segment of [T - 1 - h, T - 1] can be truncated from the whole time period, implying that the next state of a network is related to the previous h states more tightly. The exact value of h will be determined byexperiments.
An illustrative example of DyLiP
DyLiP in a supervised link prediction method that can learn the connection between the variation of structural properties and the formation of links. Figure 1 illustrates a simple example on how DyLiP works within a series of network graphs. We can notice that some new links formed in the network at time t2 and t3. For the pair node <u, v>, two common properties CN (Common Neighbors) and JC (Jaccard’s Coefficient) are calculated. Using the difference measure, we get 4 values for <u, v>: ΔCN < u, v > t 1 = 0, ΔCN < u, v > t 2 = 1, ΔJC < u, v > t 1 = -0.1 and ΔJC < u, v > t 2 = 0.2, which forming two vectors (0,2) and (-0.1, 0.2). If node u and v are connected in time t4, then the labels of the two vectors are 1, otherwise 0. For n node pairs in the training set, we have n vectors on CN and on JC, respectively. Then, two classifiers Clf CN and Clf JC are trained based on two training sets and weighted by optimization. Finally, the possibility of forming a link for an arbitrary <u′, v′> will be predicted in time t5.
Experiments
Data sets and properties
In experiments, three collaboration networks (nucl-th, hap-lat and hep-ex) on the arXiv website are used to evaluate DyLiP and other several comparative methods. There are three main reasons to use these three datasets. Firstly, collaboration networks evolve with time and the formation of new links follows the dynamic law of network. Secondly, the scale of three datasets are suitable to experimental research. The datasets with a large-scale concerns the problem of time and space consumption while data selectively sampled is difficult to be consistent with the original distribution of data. Finally, they are also the datasets used in a comparative method [11].
In a collaboration network, a node represents an author and a link indicates that two authors once cooperated with each other (for example, they contributed a paper together). The life time of the networks and the total number of nodes and edges for each network are listed in Table 1.
In different networks, the division manner of time granularity is also different for the convenience of experiments. For nucl-th and hep-lat, the sequence is (G1994, G1996, . . . , G2008, G2010) by the time step of two years. For hep-ex, the sequence is (G2001, G2002, . . . , G2011, G2012) by the time step of one year.
We use the same experimental settings with Da Silva Soares′ method for a fair comparison. The structural properties used are: Preferential Attachment (PA), Common Neighbors (CN), Adamic Adar (AA) and Jaccard’s Coefficient (JC). Table 2 lists the description of the properties, where Γ (v) is the set of neighbors of a node v and |Γ (v) | is the cardinality of Γ (v).
Experiment settings
An evolutionary sequence of a network is (G1, G2, . . . , GT-1, G T ). All newly formed links from GT-1 to G T are grouped as set A. Each <v i , v j > satisfying the following conditions is added into set B: 1) the distance from v i to v j is 2-hop; 2) v i and v j have not be connected until GT-1. Thus, we have set C (C = A ∩ B) and set D (D = A - B). The instances in set C are regarded as the positive instances in the test set and the same quantity of instances selected from set D are regarded as the negative instances in the test set. The training set is formed in the same way, except that the set A here is generated from GT-2 to GT-1 and the set B is generated from GT-2.
The reasons to select the nodes which are 2-hop away from each other as training instances are: Firstly, the nodes with 2-hop from each other are more likely to connect since they have more common neighbors. Secondly, one of the comparative methods [11] selected 2-hop pairs to generate the training set and for the sake of comparison, we use the same way to generate training instances. Thirdly, If the distance within two nodes is beyond 2-hop, the value of structual properties based on common neighbors are usually zero which will greatly affect the results of the predictor.
Training instances and testing instances are randomly selected from the entire data with about 1: 1 ratio. The size of training sets and test sets in three networks are listed in Table 3.
The h value and the measure of dynamics mentioned in Section 3.1 should be determined in the experiments. Initially, h is assigned as 1. Two different training sets are generated using the difference method and the rate-of-change method respectively, from which two classifiers and are learned. If there was a time slice h′, where h′ > h and the AUC values calculated in the different settings or , h would be replaced by h′. At the same time, the measure of dynamics (difference or rate-of-change) would be decided depending on the larger one between and . In order to avoid the division by 0 in the rate-of-change method, each property value is preprocessed by adding 1 before calculating.
After finding an appropriate h value and an efficient measure of dynamics for each structural property, an ultimate classifier is obtained (in this paper, we use four classifiers for four properties). The Sequential Least Squares Programming algorithm is then used to optimize to get the weight vector Vecw.
Comparative methods
To verify the effectiveness of DyLiP, three comparative methods are coded. The first is a static link prediction baseline method based on the supervised learning [16] named Static-LP. The second is a dynamic link prediction method proposed by Da Silva Soares et al. [11]. Their work is most related to our work. The third is dynamic link prediction method based on the evolutionary algorithm proposed by Bliss et al. [12] (named EA-LP in this paper). It is a dynamic link prediction method based on matrix evolution.
In Static-LP, the experimental settings are the same as DyLiP except that each structural property is calculated only in GT-1. In the second comparative method, the property value at each time step is calculated, resulting in a sequence of property values for each structural property. A model is then used to predict its value in the next time step. The prediction values of all properties form a hybrid vector which will be trained by a learning model. The method uses four predictive models including Moving Average (MA), Average (Av), Linear Regression (LR) and Simple Exponential Smoothing (SES). In the third comparative method [12], only topological similarity indices are used, since the networks in our experiments do not contain individual characteristics. The CMA-ES algorithm is used to optimize the linear combination.
The division of training sets and test sets is the same in all four methods.
Experiment results and analysis
Table 4 to 6 list the measure of dynamics, the optimal training time slice (the value of h) and the weight of the classifier for each property in three networks. All the settings are decided by experimentation.
Figures 2 to 4 demonstrate the ROC curves of DyLiP and three comparative methods in each network. The red and black curves represent respectively the performance of two predictive models in [11] who achieved better performances in all four models. The purple curve represents the performance of EA-LP method [12].
The AUC values of all the methods in three networks are listed in Table 7.
The experimental results show that the performance of link prediction has been greatly improved by considering the variety of structural property values with time. For example, in nucl-th, the AUC value of DyLiP is 8.5% higher than that of the static method, and 13.4% higher than the best result of Da Silva Soares’ method. The AUC value of DyLiP is much higher than EA-LP in nucl-th and hep-lat. In hep-ex, the AUC value of two methods are nearly equal. It shows that the performance of EA-LP is unstable and sensitive to the network structure. In nucl-th and hep-lat, the Static-LP performs better than Da Silva Soares’ method. It is verified that sometimes the traditional simple methods could outperfom their sophisticated counterparts [4, 8]. Although Da Silva Soares’ method uses the dynamic information of the network, the prediction performance is worse than expected. The possible reason is that the use of predictive models seems a little simple and only the artificial predictive values are trained. In contrast, DyLiP uses the real information by training the evolutionary sequence for each property, yielding improved performance.
It is also noticed that the value of h is always small (between 2 and 4) in the three networks, which signifies that the evolution information of the network within a short time before the current time step is already sufficient for the prediction. Moreover, the different weights assigned to the classifiers in DyLiP show that the weights of Classifer-CN and Classifier-AA are overall higher than those of Classifer-PA and Classifer-JC in spite of the slight difference in different networks. It implies that the CN and AA properties are more suitable for describing the evolution of the network structure than PA and JC.
Overall, DyLiP is an effective attempt to explore the network dynamics using machine learning techniques. The result shows that in most cases, the performance of DyLiP is superior to the comparative methods, in particular two state-of-art dynamic link prediction methods. The advantage of DyLiP is that it can learn by capturing the dynamic evolution of network structure and correlating it with the formation of links. Moreover, DyLiP learns one predictive model for each property. So when a new property comes, it simply trains a model for this property without the need to retrain the whole data set as the traditional supervised link prediction methods do [16]. However, with the evolution of the network, the retraining of the model for each property is inevitable, which is the disadvantage of DyLiP. In addition, like other dynamic link prediction methods, with the increase of network size, the training time complexity is also a problem to be handled.
Conclusion
Due to the limitation of static link prediction methods, a new dynamic link prediction method named DyLiP is proposed. The main contributions of the paper are: 1) Supervised learning is used in dynamic link prediction method and DyLiP can predict links based on a model learnt from the variation of properties; 2) Two measures of dynamics are defined to describe the variation of properties; 3) From experiments, we found that different structural property has different capability to describe dynamics of the network.
In addition to the difference and the rate-of-change, other measures to characterize the variation of the network properties will be introduced in future. Currently, we only compared our method to two state-of-the-art methods [11, 12]. The comparison to other temporal link prediction methods such as the method based on time series model [8] or the method using matrix factorization [19] will be conducted.
