A meta-heuristic framework based on clustering and preprocessed datasets for solving the link prediction problem

Abstract

This study presents a solution to a problem commonly known as link prediction problem. Link prediction problem interests in predicting the possibility of appearing a connection between two nodes of a network, while there is no connection between these nodes in the present state of the network. Finding a solution to link prediction problem attracts variety of computer science fields such as data mining and machine learning. This attraction is due to its importance for many applications such as social networks, bioinformatics and co-authorship networks. Towards solving this problem, Evolutionary Link Prediction (EVO-LP) framework is proposed, presented, analysed and tested. EVO-LP is a framework that includes dataset preprocessing approach and a meta-heuristic algorithm based on clustering for prediction. EVO-LP is divided into preprocessing stage and link prediction stage. Feature extraction, data under-sampling and feature selection are utilised in the preprocessing stage, while in the prediction stage, a meta-heuristic algorithm based on clustering is used as an optimiser. Experimental results on a number of real networks show that EVO-LP improves the prediction quality with low time complexity.

Keywords

Data clustering data under-sampling feature extraction feature selection link prediction meta-heuristic algorithms similarity measures

1. Introduction

This study introduces an approach to optimise a solution for link prediction problem by utilising the advantageous features of clustering and meta-heuristic algorithms, such as Moth-Flame Optimisation algorithm (MFO), Genetic Algorithm (GA), Particle Swarm Optimisation (PSO), Whale Optimisation Algorithm (WOA) and Differential Evolution (DE). The target is to get a potential solution to the problem of predicting the possibility of a future connection between two nodes of a network, knowing that there is no connection between these nodes in the present state of the network.

Generally, graphs provide a natural abstraction to represent interactions between different entities in a network [1]. A real or synthetic network in the world can be represented as a graph of nodes and edges. As in social network, users are represented as nodes, while the interactions between these users whether are associations, collaborations or influences are presented as edges between these nodes [2]. For instance, if there is a snapshot of a social network at time t₀, link prediction looks for predicting the edges that will be added to the network during the interval from time t₀ to a given future time t accurately [2]. So, the link prediction problem is related to the problem of inferring missing links from an observed network [2].

Importance of link prediction can be found in its wide variety of applications. Graphs could be represented as social networks, transportation networks or disease networks [1]. Link prediction can specifically be applied on these networks to analyse and solve interesting problems like predicting outbreak of a disease, controlling privacy in networks, detecting spam emails, suggesting alternative routes for possible navigation based on the current traffic patterns and so on [1].

More formally, the link prediction task can be formulated as follows [2]. An un-weighted and undirected graph G = (V, E) represents a topological structure of a network, such as social networks, where V is the set of nodes in G, and E is the set of existed edges in G. In this G, each edge e = (u, v) ∈ E and nodes u, v ∈ V represents an interaction between u and v that took place at a particular time t(e). For two time instances, t and t′, where t′ > t, let G[t, t′] denotes a sub-graph of G consisting of all edges with timestamps between t and t′, and let t₀, $t'_{0}$ , t₁ and $t'_{1}$ be four time instances, where t₀ < $t'_{0}$ ≤ t₁ < $t'_{1}$ . Then for a given network G[t₀, $t'_{0}$ ], the output is a list of edges does not present in G[t₀, $t'_{0}$ ] that are predicted to appear in the network G[t₁, $t'_{1}$ ] (see Figure 1). It is important to note that in the link prediction problem, the value of V remains the same with time.

Figure 1.

Different snapshots of the network G at two time intervals. (a) A snapshot of a network at time [t₀, $t'_{0}$ ] and(b) a snapshot of a network at [t₁, $t'_{1}$ ] with a predicted link (dashed line).

There are three types of link prediction problems [3]. (1) Only adding edges to the existing network, (2) only removing edges from the existing structure and (3) both, adding and removing edges at the same time. This study focuses on the first type of link prediction, only adding edges to the existing network.

To understand why link prediction is considered as a problem, suppose a network G’(V, E’) be an undirected, and un-weighted graph that represents the observed network. Let the complete graph denoted by G(V, E), then G’ is a sub-graph of G and the set of missing edges is E – E’. For an edge e not belonging to E’, the job is to estimate the likelihood that e ∈ (E – E’).

Consequently, there are (|V|*|V− 1|/2) - |E’| possible missing edges to predict from the existing observed network G’. If G’ is a dense graph, then |E’| will be close to ((|V|*|V− 1|/2) − b), where b is some constant between 1 and |V|. Thus, there is a constant number of missing edges to choose from, c, and O(1/c) probability of choosing these missing edges correctly at random. If G’ is a sparse graph, then |E’| will be close to (|V| − 1). Thus, we have a ((|V|*|V− 1|/2) edges to choose from, and O(2/(|V|*|V− 1|)) probability of choosing correctly at random. This is summarised in Table 1 and presented in Figure 2.

Table 1.

A comparison between dense and sparse graph based on the probability of choosing at random.

	\|E’\|	Probability of choosing at random
Dense	((\|V\|\|V −* 1\|/2) – b)	O(1/c)
Sparse	\|V\| − 1	O(1/\|V\|²)

V is a set of nodes, E is a set of edges, b is a constant between 1 and |V| and c is number of edges to choose from.

Figure 2.

Predicting the correct edge in a sparse graph versus a dense graph. (a) A dense graph with two edges, out of 28, to predict from (red dashed lines) and (b) a sparse graph with 22 edges to predict from.

Unfortunately, there are many networks, such as social networks, that are sparse. So, there is a need to find a way to narrow this probability down and make link prediction a more feasible problem. In this case, the need for meta-heuristic algorithms actually rises to guarantee finding the optimal prediction quality in a reasonable time.

The structure of this study is as follow: Section 2 presents the related works of link prediction. Section 3 describes the methodology of EVO-LP framework with illustrated example. Section 4 is about the complexity analysis of EVO-LP. Section 5 presents the data setup. Section 6 shows the experimental results and their discussions. Finally, conclusions and future work are in Section 7.

2. Related works

Due to the importance of link prediction problem in network science and its applications in different field, various approaches were proposed to provide a potential solution for prediction. This study will focus mainly on approaches used static, undirected and un-weighted networks type.

Liben-Nowell and Kleinberg [2] formalised the question ‘Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future?’ as the link prediction problem. They developed approaches to link prediction based on measures for analysing the ‘proximity’ of nodes in a network. In their work, they focused mainly on static link prediction problem especially for social networks [2].

Their proposed methods are adapted from techniques used in graph theory and social network analysis. All these methods assign a connection weight score(x, y) to pairs of nodes (x, y), based on the input graph G, and then produce a ranked list in decreasing order of score(x, y). This score represents the local similarity measure between the nodes x and y. Some of these methods are based on node neighbourhoods, such as Common Neighbours (CN), Jaccard’s Coefficient (JC) [4], Adamic/Adar (AA) and Preferential Attachment (PA). Others are based on the ensemble of all paths, such as Katz, Hitting time, PageRank and SimRank [2]. In their work, five datasets from which co-authorship networks were constructed are used: astro-ph (astrophysics), cond-mat (condensed matter), gr-qc (general relativity and quantum cosmology), hep-ph (high energy physics–phenomenology) and hep-th (high energy physics–theory). To more meaningfully represent predictor quality, they used a random predictor as baseline. Each predictor’s performance on each dataset is measured in terms of the factor improvement over this random predictor. As a result, some of the very simple measures perform surprisingly well, including CN and the AA measure [2].

Li et al. [5] proposed a similarity index for link prediction. Unlike most of the similarity-based algorithms which only focus on the roles of paths connecting two nodes but ignore the influences of two nodes themselves, they considered the roles of paths to end nodes in the network and the roles of end nodes themselves. Their proposed method not only distinguishes the roles of different paths but also integrates the roles of the end nodes. Consequently, their method obtained better performance on accuracy.

Lichtenwalter et al. [6] presented an effective flow-based predicting algorithm called PropFlow. PropFlow is unsupervised and a global similarity measure based on information flow. It represents the amount of information that flows from the source to the destination node across different paths. The higher the information gathered, the larger the similarity value between source and destination node. In their work, they extracted features such as in-degree, out-degree, PropFlow, Katz, CN and shortest path from two different datasets called cond-mat and phone. Then, they applied two different ensemble methods: bagging and random forests. The performance is measured based on the value of area under receiver operating characteristic (ROC) curve. The general framework they proposed achieves major improvements over existing methods. They did not make any feature selection methods to choose which feature set will be the suitable set. However, they use only features that are mentioned in Lichtenwalter et al. [6] for generality, but any other available features, including measures of reciprocity or node attributes, could be used.

Many other works take the advantages of supervised machine learning methods or algorithms for link prediction problem. Al Hasan et al. [7] studied link prediction as a supervised learning task. They identified a short list of features for link prediction in the co-authorship domain (BIOBASE, and DBLP datasets). These features include Proximity Features, Aggregated Features and topological features. For BIOBASE dataset, nine features are used, and for the DBLP dataset only four features are used. The algorithms that they used are SVM, Decision Tree, Multilayer Perceptron, K-Nearest Neighbours, Naive Bayes, RBF Network and Bagging. They evaluate the result using accuracy, precision, recall, F-value and squared error. They showed that a small subset of features always plays a significant role in link prediction.

Kashima et al. [8] presented an efficient incremental supervised learning algorithm based on a probabilistic model. This model used topological features of the network structure for predicting links between the nodes.

Raymond and Kashima [9] proposed fast and scalable algorithms compared with Kashima et al. [8] for link propagation by introducing efficient procedures to solve large linear equations that appear in the method.

Cukierski et al. [10] incorporated 94 distinct graph features, as in Srinivas and Mitra [1], used as input for classification with Random Forests on flickr graph and using area under the curve (AUC) as performance metrics. One of their conclusions is, it is common practice to abstract social networks as graphs and develops highly general methods for characterising them.

Feng et al. [11] tried to understand how the network structure affects the performance of link prediction methods in the view of clustering. Their experiments on both synthetic and real-world networks show that as the clustering grows, the accuracy of these methods could be improved remarkably, while for the sparse and weakly clustered network, they perform poorly.

Guimerà and Sales-Pardo [12] supposed the stochastic block models empirically grounded because they capture fundamental properties of complex networks. The six similarity-based measures include CN, AA index, Resource Allocation (RA) index, Katz index, rooted PageRank and superposed Random walk. They used Political blogs, NetScience and Power grid as real-world datasets for testing. Precision and AUC are the performance measures for their results.

Thi and Hoang [13] concentrated on the extraction of more features from user to improve the quality of predicted links. They proposed a measure calculating the similarity of individuals. The measure considers not only network topology but also the personal features. They performed experiments on ArXiv HEP-Th. This dataset is a collection of abstract paper in theoretic high energy physics in 12 years (1992–2003), which includes 29,555 publications. Performance of prediction is evaluated on AUC score. The higher similarity score, the more ability a new link appears.

Feyessa et al. [14] used a backpropagation neural network to predict existence or emergence of a link between pairs of nodes using node pair properties such as reciprocity, transitivity and shared neighbours. A limited network visibility by individual nodes is assumed; hence, the size of the node pair feature vector varies with the given visibility range. Their approach is tested on a large social object centred trust network where visibility is limited to two hops, and 828 accurate predictions out of 1000 pair of nodes are achieved. The performance of link prediction approaches is evaluated by constructing a ROC curve as most classification algorithms.

Furthermore, nature-inspired-based approaches are used to resolve the link prediction problem. These techniques mostly include meta-heuristic optimisation algorithms. These algorithms can be defined as a higher level procedure that intended to discover, generate or choose a heuristic that will be responsible to offer a proper good solution to an optimisation problem [15].

There are number of works that take the advantages of evolutionary algorithms for link prediction such as in previous works [16 –18]. Chen and Chen [16] proposed a link prediction algorithm using the well-known ant colony optimisation algorithm (ACO). By exploiting the swarm intelligence, the algorithm employs simulated ants to pass through on a reasonable graph. The ant pheromone and heuristic information are assigned in the edges of the reasonable graph. The pheromone on each link is used as the final score of the similarity between the nodes. Experimental results on a number of real networks show that using ACO algorithm improves the prediction precision while maintains low time complexity in comparison with other traditional link prediction approaches such as Katz, CN and Jaccard.

Bliss et al. [17] provided an approach to predicting future links by applying the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimise weights which are used in a linear combination of 16 neighbourhood and node similarity indices. They used topological similarity indices. For node similarity, they calculated four indices: Twitter Id similarity, tweet count similarity, word similarity and happiness similarity. One of the most interesting parts of this work is the detection of similarity indices which develops to have large, positive weights in their link predictors. Perhaps the most remarkable similarity index for which this is the case is the RA index.

For most sparse networks, link prediction becomes a difficult task. The difficulty comes from the question: How to get high prediction accuracy while maintaining low execution time? Therefore, this study proposes a framework that includes dataset preprocessing approach and a meta-heuristic algorithm based on clustering for prediction. Meta-heuristic (or evolutionary) algorithms are GA, MFO, PSO, WOA and DE. This framework is called ‘EVO-LP’ for Evolutionary–Link Prediction and will be explained in the next section.

3. Methodology of EVO-LP

This section describes the methodology of the proposed EVO-LP framework. EVO-LP is a meta-heuristic framework based on preprocessed datasets and data clustering for Link prediction. This framework is divided into two stages. First, a preprocessed stage is applied on the original dataset for a particular network. This dataset consists of a list of the network’s existing edges. Second, link prediction stage is handled by a meta-heuristic algorithm that is mapped as a classifier using data clustering to predict links and evaluate the prediction results. Note that this framework considers only the prediction of link appearance.

Example 1. To illustrate ‘EVO-LP’ stages, an example of applying these stages on a small and simple network N of six nodes and five existed edges, as in Figure 3(a), is given. This network has the corresponding edge list (pair of nodes). Edge list, as in Figure 3(b), represents an example of a raw dataset used as an input for the preprocessing stage.

Figure 3.

An example of a network N: (a) a network of six nodes and five edges and (b) corresponding existing edge instance list of the graph in (a).

3.1. Preprocessed stage

This stage includes four steps: data division into training and testing sets, feature extraction, under-sampling data instances and feature selection. Preprocessed stage of EVO-LP is modelled in detail in Figure 4.

Figure 4.

The details of EVO-LP preprocessed stage.

3.1.1. Data division into training and testing sets

Link prediction is a time-related activity; therefore, time-stamped dataset should be used [3]. To achieve that, the raw dataset should be separated into two sets: training set which represents the current state of the network for time stamp [t, t₁], and testing set for the network’s future state for the time stamp [t₁, t₂], where t, t₁ and t₂ are different time instances and t < t₁ < t₂. Consequently, given a network graph G(V, E) where V is the set of vertices and E is the set of existed edges in graph G. Then, $G_{t, t_{1}} (V, E_{1})$ represents the network for training and $G_{t_{1}, t_{2}} (V, E_{2})$ is the unknown future network for testing. Where E₂ is a subset selected from the existed edges for testing, and (E − E₂) is the remaining set of existed edges for training represented by E₁. These networks should consist of the same number of vertices V. As a result, the training set will include the existed edges part for training in addition to considering all other possible edges in the network as non-existed edges. So, there will be two classes in the resulted dataset: existed edges which are denoted by 1 and non-existed edges which are denoted by 0. Note that the resulted testing set represents the actual classification of the network.

Example 1 (continue). To apply this step, network N which is depicted in Figure 3 can be divided into training and testing sets of edges. For instance, both existed edges 2–5 and 2–6 are selected for testing and the remaining ones are for training, as shown in Table 2.

Table 2.

The result from dividing the network N, illustrated in Figure 3, into training and testing sets (actual classification).

All possible edgein network N	1–2	1–3	1–4	1–5	1–6	2–3	2–4	2–5	2–6	3–4	3–5	3–6	4–5	4–6	5–6
Training set	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0
Testing set (actualclassification)	1	1	1	0	0	0	0	1	1	0	0	0	0	0	0

3.1.2. Feature extraction

Feature is defined as a function of the basic measurement variables or attributes that specify some quantifiable property of an object and is useful for classification and/or pattern recognition [19]. Feature extraction is the process to represent raw data in a reduced form to facilitate decision making such as pattern detection, classification, prediction or recognition [19]. Obtaining a good data representation is a very domain-specific task and it is related to the available measurements.

Similarity measures such as CN [2], AA [20], JC [4], PA [21], RA [22], Graph Distance (GD) [2] and PropFlow [6] are considered as features for each edge instance in the training dataset. These measurements could be considered as the topological features for the underlying network. As in Yu et al. [23], the topological features of networks may contribute more to improving performance than other type of features such as node features. Throughout this section, the symbols x, y denote nodes, N denotes number of nodes in the network. Γ(x) and Γ(y) denote the neighbour sets of these nodes and k is the average degree.

CN method assumes that if there are two nodes with many CN, they will be connected in the future. CN can be calculated based on formula (1) [2]

Common Neighbors = | Γ (x) \cap Γ (y) |

(1)

In AA method, the CN of a pair of nodes with few neighbours contributes more to the AA score value than this with large number of relationships. AA can be calculated based on formula (2) [2]

Adamic / Adar = \sum_{z \in Γ (x) \cap Γ (y)} \frac{1}{\log | Γ (z) |}

(2)

The JC is a statistic measure used for comparing similarity of sample sets. This method is based on CN method and is computed using formula (3) [2]

Jaccard Coefficient = | \frac{Γ (x) \cap Γ (y)}{Γ (x) \cup Γ (y)} |

(3)

PA method will calculate similarity score for each pair of nodes within the network rather than only the neighbour of nodes and is computed using formula (4) [2]

Preferential Attachment = | Γ (x) | * | Γ (y) |

(4)

In RA, the similarity between the nodes x and y can be defined as the amount of resource y received from x, which is computed using formula (5) [22]

Resource Allocation = \sum_{z \in Γ (x) \cap Γ (y)} \frac{1}{k (z)}

(5)

More details about similarity measures such as probFlow and GD can be found in previous works [2, 6, 22]. All these seven measurements are calculated for each edge instance of the training dataset, either it is existed or not. As a result, each edge instance has its own values of the corresponding similarity measures. The result is a new dataset that consists of number of rows and columns. Each row represents an edge instance, while each column represents the computed feature’s scores.

There is an important and additional step after feature extraction, which is called Weighting scores. In order to increase the prediction quality, either in feature selection process or in prediction process, it is better to give the existed edges class a weight to facilitate the differentiation and reduce the overlapping between the corresponding two classes.

Example 1 (continue). As a sample result, a dataset resulted from applying feature extraction process on the training dataset of Network N is presented in Table 3. F represents a set of features. F = {F_i, where i = 1, …, m}, where m is the number of features. The entries in the table are S = {score_ij, where i = 1, …, n, and j = 1, …, m}, where n is number of edge instance.

Table 3.

A sample dataset resulted from applying feature extraction process on the training dataset of Network N.

Edge instance	F ₁	F ₂	F ₃	………….	…………	F_m	Actual classification (y)
1–2	Score₁₁	Score₁₂	Score₁₃	…	…	Score_1m	1
1–3	…	…	…	…	…	…	1
1–4	…	…	…	…	…	…	1
1–5	…	…	…	…	…	…	0
1–6	…	…	…	…	…	…	0
2–3	…	…	…	…	…	…	0
2–4	…	…	…	…	…	…	0
2–5	…	…	…	…	…	…	1
2–6	…	…	…	…	…	…	1
3–4	…	…	…	…	…	…	0
3–5	…	…	…	…	…	…	0
3–6	…	…	…	…	…	…	0
4–5	…	…	…	…	…	…	0
4–6	…	…	…	…	…	…	0
5–6	Score_n1	Score_n2	Score_n3	…	…	Score_nm	0

3.1.3. Under-sampling the majority class

In order to process the dataset for prediction, the dataset should be balanced to get suitable values for quality metrics [24 –26]. If there is a dataset that contains two classes of instances, one of them is major, while other is minor, then this dataset should be adjusted. This adjustment is made to achieve balancing and ensure fair distribution for both classes. This balancing can be achieved by under-sampling the majority class or over-sampling the minority class in the particular dataset.

Oversampling randomly duplicates the minority class samples, while under-sampling randomly discards the majority class samples in order to modify the class distribution [27]. Because of the sparse nature of most of the social networks, the number of non-existed edges is very large in comparison with existed edges. Therefore, the resulted dataset is imbalanced. To solve the imbalance in the dataset, some instances of the major class should be discarded, and then both classes have the same number of instances.

Solving imbalance in the dataset is a necessary step for both feature selection and link prediction processes. When the dataset is balanced, the selected subset of features through feature selection will be more reliable, and both specificity and sensitivity of dataset’s classes in the link prediction stage will be improved.

3.1.4. Feature selection

Bellman [28] referred to a problem which is called ‘curse of dimensionality’. Curse of dimensionality can rise because the amount of data required to provide a reliable analysis grows exponentially with the dimensionality of the data [29]. A possible solution to this problem is feature selection. Feature selection searches for a projection of the data onto a subset of features which maintains the information as much as possible [29]. Also, feature selection helps to avoid over-fitting the data in order to make further analysis possible [29].

Feature selection works by removing redundant and irrelevant features and selecting only the most significant according to some criterion such as high prediction quality [23]. There are many of feature selection algorithms such as in Sheydaei et al. [30]. Commonly, feature selection algorithms are separated into three categories [29, 31, 32]: (1) the filters which extract features from the data without any learning involved, (2) the wrappers that use learning techniques to evaluate which features are useful and (iii) the embedded techniques which combine the feature selection step and the classifier construction.

In this study, the filter method is used besides GA which is utilised to optimise feature selection quality results measured by accuracy. The GA [33, 34] can be used as a filter to find the better subset of features [35, 36] wherein the chromosome bits represent whether the feature is included or not. Figure 5 shows an example for this chromosome design. The parameters and operators of the GA can be modified within the general idea of an evolutionary algorithm to suit the data to get the best performance [36].

Figure 5.

The chromosome design used for feature selection: 1 = for remaining features, and 0 = for filtered out features, where F={F_i, where i = 1, …, n}.

The dataset will be filtered out by deleting features’ column labelled by 0 in the particular chromosome. Then a learning model will be built based on the neural network for this filtered dataset. The result from this model is the predicted classification. By comparing the actual classification with predicted one, the accuracy can be computed. The objective function computes the accuracy as a feature selection performance measurement [36].

As a result, each dataset will have its own subset of features, and then, they will be ready, as inputs, for the next step, prediction stage.

3.2. Link prediction stage

Link prediction stage is the main process in this study. Prediction process is made through clustering the dataset instances into two separated classes; existed or non-existed edges, and optimised through utilising a meta-heuristic algorithm. To build a general framework for link prediction based on meta-heuristic algorithms, link prediction stage should be divided mainly into three functions. These functions are initialisation, optimisation and cost function. Figure 6 shows an overview for the link prediction stage.

Figure 6.

EVO-LP’s prediction stage.

3.2.1. Initialisation function

The initialisation function focuses on generating number of specified individuals to build a population. These individuals should be designed in precise and suitable way to support the objective of clustering. Based on clustering, the single individual should represent the centroids. Centroids can be considered as the candidate solution that the instances can be classified based on it, and since there are two possible clusters in this study, there are two centroids to be generated randomly for each individual.

As in Figure 7, each individual is composed of two rows; each row represents a distinct centroid. Each centroid contains number of columns equal to the number of dataset features. The values of each centroid are generated randomly but within the range of upper bound and lower bound for each feature. Figure 8 shows the pseudo-code of the initialisation function.

Figure 7.

The design of individual I, each I consists of two centroids, c₁ and c₂, of features F. Each value in the cells should be within the interval of the lower bound (F_i,Lb) and upper bound (F_i,Ub) of the specified feature F_i, where i = 1, 2, …, n.

Figure 8.

EVO-LP initialisation function.

3.2.2. Optimisation

In this study, the misclassification rate of prediction should be minimised to obtain more accurate prediction result. So, we need a meta-heuristic algorithm to optimise this value of misclassification. As known, optimisation can be by maximising or minimising the fitness value obtained so far from meta-heuristic iterations.

Commonly, most meta-heuristic algorithms have specified number of iterations. Their termination depends on some criteria or based on the maximum number of iterations. During these iterations, each generation of population, generated by the initialisation function, is updated through exploration and exploitation of the search space. As in Crepinsek et al. [37], ‘Exploration is the process of visiting entirely new regions of a search space, while exploitation is the process of visiting those regions of a search space within the neighbourhood of previously visited points’. With addressing these two processes in meta-heuristic algorithm, the best solution will be closer to find soon.

This study deploys number of meta-heuristic algorithms for link prediction and conducts a comparison between them. These algorithms are as follows: MFO [38], and its modified versions MFO2 and MFO3 [39], WOA [40], PSO [41], GA [42] and DE [43].

MFO, as defined in Mirjalili [38], is a novel nature-inspired meta-heuristic and optimisation paradigm. MFO optimiser is inspired from the navigation methods of moths in nature. The first is called transverse orientation. In this navigation, a moth moves by trying to keep the same angle with regard to the light resource (as known that moths are attracted by light resources). Given that the light resource is distant from the moth, keeping the same angle with respect to the light guarantees moth flying in a straight line. In addition to keep moving in a straight line, moth usually flies spirally around the region of the light, which is considered as the second. Therefore, moth in the long run will converge on its way to the light. Mirjalili [38] utilised these navigation methods, transverse and spiral navigation, and developed the MFO. A conceptual model of these two navigation methods is illustrated in Figure 9.

Figure 9.

The transverse in a straight line (the red arrow) and the spiral movement of a moth.

The general framework of MFO is as follows: First, in the initialisation stage, MFO generates a random population of moths, and corresponding fitness values. Second, the iteration stage, where the main function is executed and the moths move around the search space. Final stage returns true if the termination criterion is satisfied and false if the termination criterion is not satisfied.

In MFO algorithm [38], a logarithmic spiral is selected as the main update method of moths. This spiral movement is the main component of the MFO because it dictates how the moths update their positions around flames. The spiral movement allows a moth to fly around a flame and not necessarily in the space between them. Therefore, the exploration and exploitation of the search space can be reliable. Logarithmic spiral is defined as in formula (6) [38]

M (i, j) = D . e^{bt} . \cos (2 Π t) + F (i, j)

(6)

where D indicates the distance of the ith moth from the jth flame, b is a constant for defining the shape of the logarithmic spiral and t is a random number in [−1, 1]. D is calculated as in formula (7)

D = | (F (i, j) - M (i, j)) |

(7)

where M(i, j) indicates the jth position for the ith moth, F(i, j) indicates the jth position for the ith flame and D indicates the distance of the ith moth from the jth flame.

Soliman et al. [39] proposed new versions of MFO, which are called MFO2 and MFO3. They used them to select optimal feature subset for the classification purposes. Instead of using the logarithmic spiral formula as in original MFO, they used the hyperbolic (reciprocal) spiral as in formula (8a) to produce MFO2, and the Archimedean spiral as in formula (8b) to produce MFO3

M (i, j) = \frac{D . \cos (2 Π t)}{t} + F (i, j)

(8a)

M (i, j) = D . t . \cos (2 Π t) + F (i, j)

(8b)

In addition, Sayedali and Anderew [40] proposed the WOA, which is a recent meta-heuristic optimisation algorithm, which mimic the social activities of whales. The algorithm is inspired from the hunting mechanism of the humpback whales.

PSO, as proposed by Kennedy and Eberhart [41], is a population-based stochastic optimisation technique, inspired by social behaviour of bird flocking or fish schooling, while GA [42] is a meta-heuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms. GAs are commonly used to generate high-quality solutions to optimisation and search problems by relying on bio-inspired operators such as mutation, crossover and selection.

DE [43] is a method that optimises a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. The principal difference between GA and DE is that GA relies on crossover, while DE strategies use mutation as the primary search mechanism. More details about MFO, MFO2, MFO3, WOA, PSO, GA and DE can be found in previous works [38 –43], respectively.

For each of these meta-heuristic algorithms, the main loop that represents the iterations loop and contains their main processes is used to optimise the link prediction results. Figure 10 shows the pseudo-code of MFO main function as an example for the optimisation stage.

Figure 10.

The main steps of MFO algorithm [38].

3.2.3. Cost function

The main objective of data clustering is to classify the most similar instances or objects into a same group. Each instance in the dataset is defined by a set of features. To determine similar instances, a similarity measure between them is required [44]. There are many similarity measures used in the literature. This study uses Euclidean distance to calculate the similarity between the instances and the suggested centroids. Euclidean distance is defined as in formula (9). Note that the terms instance and object can be used interchangeably.

distance (o_{i}, o_{j}) = {({\sum_{p = 1}^{m} | o_{ip} - o_{jp} |}^{1 / 2})}^{2}

(9)

Where o is for instance, m is the number of features and o_ip is the value of the feature number p of the instance number i, o_jp is the value of the feature number p of the instance number j.

Each individual of the population generated in the initialisation function is represented as centroids. There will be two centroids, because there are two classes of edges, existed and non-existed edges. Euclidean distance is used to calculate the distance between each instance in the dataset and the specified centroids (individual). According to Euclidean distance, each instance belongs to a specific centroid, when its distance to this centroid is lesser than that for other centroid.

The result of a clustering should be evaluated. This is done by using validity indexes [44]. They are classified into internal and external one [44, 45]. This study uses sum of squares error (SSE), which is an internal validity index. The SSE is a prototype-based cohesion measure where the squared Euclidean distance is used. SSE is defined by formula (10) [44]

SSE = \sum_{i = 1}^{k} \sum_{j = 1}^{n} W_{ij} * \sqrt{\sum_{p = 1}^{m} {(o_{jp} - c_{ip})}^{2}}

(10)

where W_ij = 1 if the instance is in the cluster and 0 otherwise, k is the number of clusters, n is the number of instances, m is the number of features and c_ip is the value of the feature number p of the centroid of the cluster number i (c_i) [44].

Each candidate solution has its own SSE value. Calculating SSE helps in classifying instances into their related classes, the resulted classification represents the predicted result. To ensure that this predicted result is validated and reliable, the misclassification rate of its related solution should be calculated. The misclassification rate is an external validity index for the solution and defined by formula (11) [44]

ER = \frac{number of misclassified instances}{total number of instances} * 100

(11)

We use misclassification rate in this study because SSE value is not enough to identify the best solution obtained so far. We can have a solution that has the minimum value of SSE but with high misclassification rate. So, after computing SSE for each solution, misclassification rate is required to specify the most validated solution among other solutions. After that, the best solution obtained so far is examined for performance using the following metrics: sensitivity or recall, AUC, G-mean (geometric mean), entropy and the execution time.

To compute the values of these measurements, confusion matrix is required. As in Liu [46], ‘It is convenient to introduce these measures using a confusion matrix, see Table 4. A confusion matrix contains information about actual and predicted results given by a classifier’.

Table 4.

Confusion matrix of a classifier [46].

	Predicted positive	Predicted negative
Actual positive	TP	FN
Actual negative	FP	TN

TP: number of correct classifications of the positive instances (true positive); FN: number of incorrect classifications of positive instances (false negative); FP: number of incorrect classifications of negative instances (false positive); TN: number of correct classifications of negative instances (true negative).

Based on the confusion matrix, the recall or sensitivity of the positive class (existed edges in this study) is defined as in formula (12) [46]. Recall measures how precise and how complete the classification is on the positive class. In other words, recall is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set [46]

recall = \frac{TP}{TP + FN}

(12)

On the contrary, specificity of the negative class (non-existed edges in this study) is the number of correctly classified negative examples in the test set. Specificity measures how precise and complete the classification is on the negative class. Specificity is defined as in formula (13) [46]

specificity = \frac{TN}{TN + FP}

(13)

Because the social networks are sparse in their nature, the number of instances in the negative class is much larger than that for positive class. G-mean measures the balanced performance of a learning algorithm between positive and negative classes [47]. G-mean is defined in formula (14) [47]. In this formula, recall represents the accuracy of positive class, while specificity represents negative accuracy. Greater G-mean means better prediction results

G - mean = \sqrt{recall * specificity}

(14)

A ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) [48]. It is also commonly used to evaluate classification results on the positive class in two-class classification problems. The classifier needs to rank the test cases according to their likelihoods of belonging to the positive class with the most likely positive case ranked at the top [46, 48]. The TPR represents recall, while FPR is defined as in formula (16) [46]. Accuracy is measured by the area under ROC curve or AUC as abbreviation. Greater AUC means better prediction model. AUC could be computed using formula (15)

AUC = \frac{1}{2} (1 + TPR) (1 - FPR)

(15)

FPR = \frac{FP}{FP + TN}

(16)

In addition, entropy is considered as performance measure. Entropy is an external validation measure [49]. The lower entropy means better clustering. The greater entropy means that the clustering is not good. The quantity of disorder is found by using entropy. So, it is expected that every cluster should have low entropy to maintain the quality of clustering results [49]. Entropy is defined as in formulas (17) through (19). Figure 11 shows the pseudo-code for the cost function

entropy = w_{1} (\frac{TP + FP}{TP + FP + TN + FN}) + w_{2} (\frac{TN + FN}{TP + FP + TN + FN})

(17)

where

w_{1} = - [(\frac{TP}{TP + FP} * \log (\frac{TP}{TP + FP})) + (\frac{FP}{TP + FP} * \log (\frac{FP}{TP + FP}))]

(18)

w_{2} = - [(\frac{TN}{TN + FN} * \log (\frac{TN}{TN + FN})) + (\frac{FN}{TN + FN} * \log (\frac{FN}{TN + FN}))]

(19)

Figure 11.

EVO-LP cost function.

4. EVO-LP complexity analyses

To determine the time complexity for EVO-LP, its main functions should be analysed. As illustrated in Section 3, EVO-LP mainly consists of initialisation, optimisation and cost functions. Throughout this section, the symbol x denotes the number of instances in the processed dataset, k is the number of classes, s is the size of population, dim is the number of features in the processed dataset and t is the maximum number of iterations. The overall time complexity for EVO-LP is expressed as in formula (20)

O (EVO - LP) = O (initialization function) + O (t (O (cost Function) + O (optimization function)))

(20)

The time complexity for initialisation function can be equal to O(k s dim). Because there are just two classes in this problem, k will be equal to 2. Therefore, time complexity for the initialisation will be approximated to O(2 s dim), which can be approximated to O(s dim). See formula (21)

O (initialization function) = O (s \dim)

(21)

In the cost function, the pairwise distance between each instance x in the dataset and the individual I is calculated with time complexity of O(x). Assigning each instance x to its closest centroid in I is equal to O(x). So, the total complexity for the cost function will be O(x) + O(x) = O(2x) which can be approximated to O(x). Because the cost function is applied on each population s, the complexity will be as in formula (22)

O (cost function) = O (s x)

(22)

The time complexity for the optimisation function depends mainly on the utilised optimiser for the optimisation process, which can be MFO, GA, PSO or other optimiser. As an example, the time complexity for MFO, as in Mirjalili [38], is as in formula (23)

O (MFO) = O (s^{2} + s \dim)

(23)

Based on formulas (20)–(22), the overall time complexity for EVO-LP is as in formula (24). As an example, by utilising MFO in the optimisation function, and based on formula (23), the overall time complexity for MFO-LP can be represented as in formula (25)

O (EVO - LP) = O (s \dim) + O (t (O (s x) + O (optimization function)))

(24)

O (MFO - LP) = O (s \dim) + O (t s x) + O (t s^{2}) + O (t s \dim)

(25)

Because the term O(s dim) is smaller than the term O(t s dim), the latter dominates. In addition, the term O(t s2) is larger than O(t s dim), because population size s usually is much larger than the number of features dim. So, the term O(t s2) dominates the term O(t s dim) in formula (25). This study needs number of population s much smaller than the number of instances x. Then, the value of O(t s2) is still smaller than the term O(tsx). Therefore, the term O(tsx) dominates the term O(t s2). So, the time complexity of MFO-LP can be approximated as in formula (26)

O (MFO - LP) = O (t s x)

(26)

From this formula, we can conclude that EVO-LP time complexity depends mainly on the number of instances x in the processed dataset and the maximum number of iterations t.

5. Data setup

For experiments, we keen to select the number of networks that satisfies these criteria: (1) ensure the diversification of the network’s from different fields, such as social networks, biological networks, information networks and publication networks; (2) ensure gradation in the number of already existed edges between the selected networks. Based on these criteria, we chose five benchmark datasets [50, 51]. These datasets are as follows: US airport network (USAir), network of the US political blogs (Political blogs), co-authorships network between scientists (NetScience), network or protein interaction (Yeast) and the King James Bible and information about their occurrences (King James) [51].

In order to make these datasets ready for the prediction process, preprocessed stage, described in Section 3.1, is applied on them separately. All processes in this stage are implemented and tested in MATLAB R2017a.

5.1. Dataset division

Each dataset will be divided into training and testing sets as 80% for training set and 20% for testing set. The ratio 80/20 is used based on the Pareto Principles [52] which states that ‘for many events, roughly 80% of the effects come from 20% of the causes’.

5.2. Feature extraction

After that, features are computed for each pair of nodes in the training set either this pair already has an edge or not. So there will be two classes in the dataset, existed and non-existed edges. These pairs of nodes are considered as instances in the dataset. As a result, similarity scores resulted from applying the similarity measures are considered as instance’s features. Similarity measures used in feature extraction are AA, PA, CN, JC, RA, GD and PropFlow.

5.3. Weighting scores

It is important to take care about the distribution of the instances in the dataset. Dataset distribution helps in specifying whether instances of each class are disjoints or overlapped. Availability of overlapping in the dataset reduces the prediction quality. So, overlapping should be reduced as possible. This had been done by multiplying class1 (existed edges) scores by a factor b, where b is an integer randomly selected such that b = {2, 3, …, 10}. Figures 12(a)–(j) show the class distribution, of USAir, NetScience, Political blogs, Yeast, King James datasets, before and after weighting the scores is presented. In these figures, blue coloured area is for class1, existed edges, while pink coloured area is for class2, non-existed edges. The area coloured in purple represents the overlapping between instances of class1 and class2 of the dataset. It is very clear, overlapping is very high before weighting which will affect the prediction result latter on. Therefore, weighting scores reduce these overlapping.

Figure 12.

(a) The distribution of USAir dataset before weighting scores, (b) the distribution of USAir dataset after feature’s weighting, (c) the distribution of NetScience dataset before feature’s weighting, (d) the distribution of NetScience dataset after feature’s weighting, (e) the distribution of Political blogs dataset before feature’s weighting, (f) the distribution of Political blogs dataset after feature’s weighting, (g) the distribution of Yeast dataset before feature’s weighting, (h) the distribution of Yeast dataset after feature’s weighting, (i) the distribution of King James dataset before feature’s weighting and (j) the distribution of King James dataset after feature’s weighting.

5.4. Under-sampling data instances

As described in Section 3.1.3., to solve the imbalance in the dataset, some instances of the major class (non-existed edges) should be discarded. As a result, a balance is achieved.

5.5. Feature selection

In order to increase the quality of the evaluation results, feature selection is required. This is done by using GA as a feature’s filter as described in Section 3.1.4.

Table 5 shows summery of each benchmark dataset. This summery includes number of nodes and edges of the original dataset, number of instances resulted from under-sampling process, and suitable subset of features, for each dataset, resulted from feature selection process. After applying feature selection, USAir, NetScience, Political Blogs, Yeast and King James datasets will have four, three, four, five and four number of features, respectively. Narrowing down features number for each dataset depending on feature selection helps improving the prediction results latterly in the prediction stage. Each selected feature is checked with a tick under it with respect to the dataset. AA is for Adamic/Adar, CN for Common Neighbour, RA for Resource Allocation, PA for Preferential Attachment, JC for Jaccard Coefficient, GD for Graph Distance and PF for PropFlow. As a result, five processed datasets are ready for EVO-LP algorithm.

Table 5.

The original datasets details (number of nodes and edges) and their resulted number of instances and selected features.

				Features (similarity measures)
Dataset name	Nodes	Edges	Instances	AA	CN	RA	PA	JC	GD	PF
USAir	332	2126	4252	✓	✓		✓		✓
NetScience	1589	2742	5484		✓	✓				✓
Political Blogs	1222	5366	10732	✓				✓	✓	✓
Yeast	2375	6136	12272	✓	✓	✓	✓	✓
King James	1773	9131	18262		✓	✓		✓		✓

AA: Adamic/Adar; CN: Common Neighbour; RA: Resource Allocation; PA: Preferential Attachment; JC: Jaccard Coefficient; GD: Graph Distance; PF: PropFlow.

Each selected feature is checked with a tick.

6. Experimental results and discussions

In this section, the results of implementing EVO-LP using the five processed datasets are presented, evaluated and discussed. All experiments are implemented in MATLAB R2017a and conducted using Intel^® Core™ i7-5500 CPU @ 2.40 GHz processor, 8.00GB RAM, and Microsoft Windows8 of 64-bit Operating System.

The evolutionary algorithms: MFO, MFO2, MFO3, WOA, GA, PSO and DE, are chosen, as an optimised for EVO-LP, to test their prediction quality and conduct a comparison between them. Hence there are MFO-LP, MFO2-LP, MFO3-LP, GA-LP, PSO-LP, WOA-LP and DE-LP algorithms for testing. In addition, AA, which is one of the traditional similarity–based link prediction methods, is implemented and compared with EVO-LP results. AA is chosen for comparison, because it is one of traditional link prediction methods that performs surprisingly well [2].

For each EVO-LP algorithm, the population size is set to 50, the maximum number of iterations is set to 200 and the stopping criteria are the maximum number of iterations. Each EVO-LP algorithm had been executed 30 times independently. We focus on testing the prediction quality measured by the misclassification rate (cost function) and computing the performance metrics such as recall, G-mean, AUC, entropy and the computational time. Each of these measurements is stored for each independent 30 runs, and the best, average and standard deviation values are taken among these runs.

6.1. Cost function

As described in Section 3.2.3., misclassification rate is the cost function used to evaluate algorithm’s solution obtained so far. Lowest misclassification rate for one algorithm means highest prediction quality among other algorithms. Table 6 shows the misclassification rate for each EVO-LP algorithm applied on the five processed datasets. The Best, Average and STDEV represent the best, average and standard deviation of misclassification rate over 30 independent runs, respectively. In Table 6, the lowest Best, Average and STDEV are emphasised in bold-face. Bold STDEV represents the best standard deviation over EVO-LP algorithms, while the underlined represent the best standard deviation among all the algorithms including AA.

Table 6.

The misclassification rate for each EVO-LP algorithm.

Dataset/Algorithm		AA	MFO-LP	MFO2-LP	MFO3-LP	WOA-LP	GA-LP	PSO-LP	DE-LP
Processed USAir	Best	0.1410	0.0719	0.0061	0.1141	0.1101	0.1284	0.1242	0.1279
	Average	0.1410	0.1195	0.0962	0.1377	0.1135	0.2459	0.2293	0.1603
	STDEV	0.0000	0.0163	0.0688	0.0392	0.0075	0.0266	0.0572	0.0306
Processed NetScience	Best	0.1902	0.0573	0	0.1827	0.1501	0.1831	0.1831	0.1085
	Average	0.1902	0.1579	0.0549	0.2117	0.1606	0.2946	0.2886	0.2029
	STDEV	0.0000	0.0513	0.0853	0.0560	0.0177	0.0504	0.0538	0.0334
Processed Political blogs	Best	0.2142	0.1884	0.0926	0.1901	0.2001	0.1936	0.1901	0.1705
	Average	0.2142	0.1913	0.1619	0.2029	0.2003	0.2251	0.2045	0.1935
	STDEV	0.0000	0.0179	0.0508	0.0175	0.0346	0.0163	0.0487	0.0608
Processed Yeast	Best	0.3531	0.3434	0.3000	0.3463	0.3015	0.3498	0.3464	0.3017
	Average	0.3531	0.3784	0.3111	0.3797	0.3646	0.3845	0.3853	0.3759
	STDEV	0.0000	0.0137	0.0293	0.0117	0.0251	0.0081	0.0074	0.0171
Processed King James	Best	0.4935	0.3523	0.3501	0.3542	0.4276	0.4168	0.4132	0.4132
	Average	0.4935	0.3547	0.3508	0.3582	0.4296	0.4233	0.4201	0.4169
	STDEV	0.0000	0.1043	0.1042	0.1443	0. 0116	0.1362	0.1456	0.1296

AA: Adamic/Adar; MFO: Moth-Flame Optimisation algorithm; EVO-LP: Evolutionary Link Prediction; WOA: Whale Optimisation Algorithm; GA: Genetic Algorithm; PSO: Particle Swarm Optimisation; DE: Differential Evolution.

Best, Average and STDEV represent the best value, the average and the standard deviation of the misclassification rate over 30 independent runs, respectively. Italic STDEV represents the best standard deviation over EVO-LP algorithms, while the underlined represent the best standard deviation among all the algorithms including AA.

As shown in Table 6, all EVO-LP algorithms (MFO-LP, MFO2-LP, MFO3-LP, WOA-LP, GA-LP, PSO-LP, DE-LP) have average misclassification rate scores lower than that for traditional method AA. Among all EVO-LP algorithms, MFO2-LP has the lowest average misclassification rate score. Even for the largest King James dataset, the MFO2-LP is still the best with 0.3508, while the other algorithms get average misclassification rate scores from 0.3547 to 0.4296. MFO2-LP has a standard deviation not as good as that for other algorithms. This means MFO2-LP is less stable than other EVO-LP algorithms. AA has a 0 standard deviation because it is not a meta-heuristic algorithm and gives same results for each independent run for the same set of data.

From Tables 5 and 6, we can find that the misclassification rate scores of the datasets are proportional to their number of instances; an algorithm can get better misclassification rate on datasets with lesser number of instances.

Figure 13(a)–(d) indicates the behaviour of each EVO-LP algorithms during their iterations until getting the best solution that has the lowest misclassification rate. If we take a look at these figures, MFO2-LP clearly has the fastest convergence behaviour among other algorithms across the majority of the datasets. So, it has the ability to find the best solution early in comparison with other EVO-LP algorithms because it uses the hyperbolic (reciprocal) spiral to update search space instead of the logarithmic spiral found in the original MFO. DE-LP is the slowest. MFO-LP, MFO3-LP, GA-LP, PSO-LP and WOA-LP have a convergence speed and behaviour approximately close to each other.

Figure 13.

(a) The convergence curves of misclassification rate for each EVO-LP algorithms applied on the processed USAir dataset, (b) the convergence curves of misclassification rate for each EVO-LP algorithms applied on the processed NetScience dataset, (c) the convergence curves of misclassification rate for each EVO-LP algorithms applied on the processed Political blogs dataset and (d) the convergence curves of misclassification rate for each EVO-LP algorithms applied on the processed Yeast dataset.

6.2. Performance metrics

To evaluate the quality of the prediction results, performance metrics such as recall, G-mean, AUC, entropy and computational time are computed for each EVO-LP algorithm over the five processed datasets and 30 independent runs.

6.2.1. Recall

Recall measures how precise and how complete the classification is on the positive class, the existed edges. We need to measure recall which helps to quantify how much is the TPR in the prediction results. If we have high recall, then we can guarantee high values for AUC and G-mean. Highest recall for one algorithm means highest prediction quality among other algorithms. Table 7 shows the recall for each EVO-LP algorithm applied on the five processed datasets. The Best, Average and STDEV represent the best, average and standard deviation of recall over 30 independent runs, respectively. In Table 7, the highest Best and Average are emphasised in bold-face. Bold STDEV represent the best standard deviation over EVO-LP algorithms, while the underline STDEV represents the best standard deviation among all algorithms including AA.

Table 7.

The recall for each EVO-LP algorithm.

Dataset		AA	MFO-LP	MFO2-LP	MFO3-LP	WOA-LP	GA-LP	PSO-LP	DE-LP
Processed USAir	Best	0.6061	0.8998	1.0000	0.8448	0.8579	0.8086	0.8288	0.8231
	Average	0.6061	0.8354	0.8612	0.7991	0.8139	0.6720	0.6319	0.7177
	STDEV	0.0000	0.0228	0.1148	0.0833	0.0153	0.0538	0.1177	0.0843
Processed NetScience	Best	0.5257	0.8899	1.0000	0.7476	0.8279	0.7217	0.7127	0.7834
	Average	0.5257	0.7357	0.9096	0.6446	0.7228	0.5547	0.5699	0.6614
	STDEV	0.0000	0.0788	0.1405	0.1290	0.0513	0.1118	0.1135	0.0827
Processed Political blogs	Best	0.5821	0.7282	0.9584	0.7280	0.7464	0.6944	0.7092	0.7082
	Average	0.5821	0.7039	0.7813	0.6796	0.6934	0.6292	0.6754	0.6891
	STDEV	0.0000	0.0169	0.1294	0.0382	0.0477	0.0327	0.1119	0.0299
Processed Yeast	Best	0.5473	0.6294	0.6508	0.6233	0.6017	0.6094	0.6242	0.6153
	Average	0.5473	0.6233	0.6365	0.6101	0.5944	0.5991	0.6077	0.6019
	STDEV	0.0000	0.1156	0.1144	0.1213	0.0543	0.1078	0.1234	0.0906
Processed King James	Best	0.2934	0.5764	0.5876	0.5722	0.5748	0.5720	0.5012	0.5199
	Average	0.2934	0.3504	0.3612	0.3443	0.3326	0.3276	0.3350	0.3175
	STDEV	0.0000	0.0691	0.0779	0.0552	0.0732	0.0474	0.0691	0.1350

Best, Average and STDEV represent the best value, the average and the standard deviation of the recall over 30 independent runs, respectively. Italic STDEV represents the best standard deviation over EVO-LP algorithms, while the underlined represent the best standard deviation among all the algorithms including AA.

As shown in Table 7, all EVO-LP algorithms have average recall scores higher than that for the traditional method AA. Among all EVO-LP algorithms, MFO2-LP has the highest average recall score. Even for the largest King James dataset, the MFO2-LP is still the best with 0.3612, while the other algorithms get average recall scores from 0.3175 to 0.3504, but MFO2-LP has a standard deviation not as good as that for other algorithms.

6.2.2. G-mean

The G-mean measures the balanced performance of a learning algorithm between positive class (class1: existed edges) and negative class (class2: non-existed edges). Highest G-mean for one algorithm means highest prediction quality among other algorithms. Table 8 shows the G-mean for each EVO-LP algorithm applied on the five processed datasets.

Table 8.

The G-mean for each EVO-LP algorithm.

Dataset/Algorithm		AA	MFO-LP	MFO2-LP	MFO3-LP	WOA-LP	GA-LP	PSO-LP	DE-LP
Processed USAir	Best	0.6954	0.9276	0.9939	0.8850	0.8894	0.8693	0.8745	0.8707
	Average	0.6954	0.8793	0.9012	0.8585	0.8833	0.7310	0.7481	0.8286
	STDEV	0.0000	0.0166	0.0732	0.0465	0.0068	0.0320	0.0734	0.0397
Processed NetScience	Best	0.6425	0.9413	1.0000	0.8089	0.8496	0.8085	0.8085	0.7834
	Average	0.6425	0.8346	0.9425	0.7695	0.8302	0.6546	0.6648	0.6614
	STDEV	0.0000	0.0543	0.0893	0.0798	0.0204	0.0720	0.0728	0.0827
Processed Political blogs	Best	0.6133	0.8045	0.9059	0.8035	0.7981	0.7986	0.8035	0.8171
	Average	0.6133	0.8018	0.8324	0.7880	0.7914	0.7608	0.7809	0.7974
	STDEV	0.0000	0.0095	0.0543	0.0210	0.0065	0.0197	0.0856	0.0089
Processed Yeast	Best	0.5982	0.7392	0.7537	0.7304	0.7109	0.7099	0.7348	0.7271
	Average	0.5982	0.7292	0.7480	0.7241	0.6922	0.6964	0.7206	0.7130
	STDEV	0.0000	0.0748	0.0739	0.0824	0.0456	0.0876	0.0870	0.0581
Processed King James	Best	0.5117	0.6524	0.6778	0.6491	0.6760	0.6455	0.6490	0.6414
	Average	0.5117	0.5557	0.5685	0.5532	0.5352	0.5422	0.5432	0.5634
	STDEV	0.0000	0.0332	0.0491	0.0284	0.0250	0.0211	0.0249	0.0470

Best, Average, and STDEV represent the best value, the average, and the standard deviation of the G-mean over 30 independent runs, respectively. Italic STDEV represents the best standard deviation over EVO-LP algorithms, while the underlined represent the best standard deviation among all the algorithms including AA.

As shown in Table 8, all EVO-LP algorithms have average G-mean scores higher than that for the traditional method AA. Among all EVO-LP algorithms, MFO2-LP has the highest average G-mean score, because MFO2-LPs has the highest recall. Even for the largest King James dataset, the MFO2-LP is still the best with 0.5685, while the other algorithms get average G-mean scores from 0.5352 to 0.5634. On the contrary, WOA-LP has the best standard deviation over other algorithms.

6.2.3. AUC

Greater AUC means better prediction model. Highest AUC for one algorithm means highest prediction quality among other algorithms. Table 9 shows the AUC for each EVO-LP algorithm applied on the five processed datasets.

Table 9.

The AUC for each EVO-LP algorithm.

Dataset/Algorithm		AA	MFO-LP	MFO2-LP	MFO3-LP	WOA-LP	GA-LP	PSO-LP	DE-LP
Processed USAir	Best	0.7301	0.9084	0.9878	0.8551	0.8853	0.8452	0.8539	0.8504
	Average	0.7301	0.8495	0.8818	0.8323	0.8697	0.7358	0.7522	0.8249
	STDEV	0.0000	0.0203	0.0775	0.0347	0.0167	0.0250	0.0540	0.0230
Processed NetScience	Best	0.6487	0.9408	1.0000	0.7982	0.8498	0.7976	0.7961	0.8914
	Average	0.6487	0.8239	0.9384	0.7652	0.8229	0.6948	0.7000	0.7740
	STDEV	0.0000	0.0595	0.0958	0.0496	0.0230	0.0450	0.0492	0.0341
Processed Political blogs	Best	0.6211	0.7829	0.8385	0.7791	0.7879	0.7870	0.7791	0.8200
	Average	0.6211	0.7783	0.7953	0.7680	0.7664	0.7498	0.7662	0.7799
	STDEV	0.0000	0.0071	0.0322	0.0148	0.0187	0.0155	0.0427	0.0107
Processed Yeast	Best	0.5871	0.7630	0.7640	0.7594	0.6139	0.7033	0.7597	0.7438
	Average	0.5871	0.6965	0.7098	0.6739	0.5739	0.6183	0.6623	0.6996
	STDEV	0.0000	0.0569	0.0576	0.0603	0.0263	0.0534	0.0614	0.0445
Processed King James	Best	0.4982	0.6173	0.6966	0.6236	0.7000	0.6175	0.5989	0.6962
	Average	0.4982	0.6016	0.6176	0.6013	0.5746	0.5990	0.5950	0.5808
	STDEV	0.0000	0.0081	0.0330	0.0098	0.0418	0.0067	0.0167	0.0427

AUC: area under the curve; AA: Adamic/Adar; MFO: Moth-Flame Optimisation algorithm; EVO-LP: Evolutionary Link Prediction; WOA: Whale Optimisation Algorithm; GA: Genetic Algorithm; PSO: Particle Swarm Optimisation; DE: Differential Evolution.

The best, average and standard deviation of the AUC over 30 independent runs. Italic STDEV represents the best standard deviation over EVO-LP algorithms, while the underlined represent the best standard deviation among all the algorithms including AA.

As shown in Table 9, all EVO-LP algorithms have average AUC scores higher than that for the traditional method AA. Among all EVO-LP algorithms, MFO2-LP has the highest average AUC score. This is due to the large value of MFO2-LP’s recall, which means large TPR, and then large AUC. Even for the largest King James dataset, the MFO2-LP is still the best with 0.6176, while the other algorithms get average AUC scores from 0.5808 to 0.6016. The WOA-LP has the best standard deviation over other algorithms.

6.2.4. Entropy

Better prediction results can be obtained, when Entropy becomes closer to zero. Lowest entropy for one algorithm means highest prediction quality among other algorithms. Table 10 shows the entropy for each EVO-LP algorithm applied on the five processed datasets. As shown in Table 10, all EVO-LP algorithms have average entropy scores lower than that for the traditional method AA. Among all EVO-LP algorithms, MFO2-LP has the lowest average entropy score. Even for the largest King James dataset, the MFO2-LP is still the best with 0.8089, while the other algorithms get average entropy scores from 0.8366 to 0.9406. The WOA-LP has the best standard deviation over other algorithms.

Table 10.

Entropy for each EVO-LP algorithm.

Dataset/Algorithm		AA	MFO-LP	MFO2-LP	MFO3-LP	WOA-LP	GA-LP	PSO-LP	DE-LP
Processed USAir	Best	0.7741	0.3667	0.3142	0.5046	0.4434	0.5380	0.5241	0.5337
	Average	0.7741	0.5176	0.5085	0.5576	0.4810	0.7593	0.7220	0.5784
	STDEV	0.0000	0.0517	0.0804	0.0723	0.0331	0.0483	0.1021	0.0418
Processed NetScience	Best	0.8765	0.6495	0.2786	0.6483	0.5085	0.6495	0.6523	0.4133
	Average	0.8765	0.6548	0.5792	0.6987	0.5818	0.8108	0.8043	0.6821
	STDEV	0.0000	0.1530	0.0020	0.0757	0.0559	0.0688	0.0771	0.0636
Processed Political blogs	Best	0.8524	0.6747	0.4291	0.6810	0.6657	0.6703	0.6810	0.5994
	Average	0.8524	0.6822	0.6040	0.7021	0.6947	0.7370	0.6995	0.6786
	STDEV	0.0000	0.0135	0.1268	0.0288	0.0226	0.0278	0.0559	0.0186
Processed Yeast	Best	0.8105	0. 7351	0. 6635	0. 7601	0.7662	0.7654	0.7577	0.7652
	Average	0.8105	0.7594	0.7013	0.7891	0.7757	0. 7736	0.7693	0.7811
	STDEV	0.0000	0.0258	0.0254	0.0262	0.0083	0.0200	0.0264	0.0179
Processed King James	Best	0.9760	0.9224	0.7683	0.9157	0.7883	0.9216	0.9290	0.7895
	Average	0.9760	0.9364	0.8089	0.9357	0.8366	0.9390	0.9406	0.9351
	STDEV	0.0000	0.0064	0.0495	0.0074	0.0585	0.0049	0.0040	0.0297

6.2.5. Execution time

Figure 14 indicates the execution time for each algorithm over the five processed datasets. As shown from the figure, EVO-LP algorithms spend less time for execution in comparison with time spending by the traditional algorithm AA. As in formula (26), EVO-LP algorithms depend mainly on the number of instances in the processed datasets, while AA depends mainly on the number of nodes in the observed network. AA complexity is O(K2 N) [3], where K is the average degree and N is number of nodes in the network. Number of nodes in the network is usually much larger than the number of instances produced from preprocessing the network, as in our work. Therefore, EVO-LP algorithms performed faster than AA algorithm.

Figure 14.

Execution time of the algorithms.

Over the EVO-LP algorithms, we can observe that MFO-LP and their versions MFO2-LP and MFO3-LP have the minimal execution time. This is because MFO-LP has a complexity of O(t s x), where t is the maximum iterations, s is the population size and x is the number of instances. Due to s and t are the same for each algorithm, x will affect the execution time. Comparing Table 5 with Figure 14, when the number of instances in the dataset increases, the execution time will increase proportionally.

7. Conclusion and future work

This study presents a meta-heuristic framework based on clustering and using a processed datasets for solving link prediction problem. This framework is divided into preprocessing stage and link prediction stage. For each stage, there are a number of experiments conducted, as discussed previously, and conclusions are built upon these experiments.

The merits of this study lie in each step of the methodology of EVO-LP framework. Each step can be considered valuable in itself and complement to other steps. This section discusses these merits as follows. To ensure the gradation in the number of instances in the processed dataset, a variety of networks from different fields, such as social, biological, information and publication networks, are selected with varied number of nodes and already existed edges. In our approach, the number of instances in the processed dataset depends on the already existed edges in the original network proportionally.

Furthermore, dividing the original dataset into training and testing sets supports the prediction process and takes the benefits of supervised learning besides the unsupervised learning that can be found in the prediction stage by using data clustering process. Choosing 80/20 ratio based on the Pareto Principle [52] was successful. EVO-LP algorithms proved their success in the prediction quality results and execution time.

In addition, dataset distribution helps in specifying whether instances of each class are disjoints or overlapped. Availability of overlapping in the dataset reduces the prediction quality. So, overlapping should be reduced as possible. Therefore, the weighting score process should be applied to increase both the recall and the specificity of the processed dataset’s classes. As a result, the overall prediction quality will be improved well.

Data under-sampling and feature selection are very important processes because both of them help in increasing the prediction quality. Under-sampling the major class leads to improve both recall, TPR of the first class (existed edge) and specificity, true negative rate for the second class (non-existed edges) and then prevents one class dominate other class. Therefore, the performance of prediction measured by G-mean and AUC will increase significantly.

Preprocessing stage makes the complexity of our approach depends on the number of instances in the processed dataset, not on the number of nodes or edges in the original dataset. This is the reason that EVO-LP algorithms outperform the traditional similarity–based link prediction methods such as AA in both prediction quality and execution time.

The misclassification rate was chosen as a cost function to evaluate the prediction result. As illustrated in Section 3.2.3, this result is produced by applying data clustering through using SSE. We let misclassification rate decides whether the prediction result is approved or not because we cannot just depend on SSE as a cost function. This is because in some cases we can have small SSE but with large misclassification rate.

Also, we focused to measure the recall which helps to quantify how much is the TPR in the prediction results. If we have high recall, then we can guarantee high values for AUC and G-mean. Highest recall for one algorithm means highest prediction quality among other algorithms.

The experimental results show that the misclassification rate scores of the datasets are proportional to their number of instances; an algorithm can get better misclassification rate on datasets with lesser number of instances. This brings our attention to the need for dividing the dataset instances into number of subsets and sends each subset to one processor in parallel approach.

EVO-LP was applied using number of well-regarded meta-heuristic algorithms which are MFO-LP, MFO2-LP, MFO3-LP, GA-LP, PSO-LP, WOA-LP and DE-LP. A comparison between these algorithms was conducted in terms of recall, G-mean, AUC, classification error rate, entropy and the execution time. The experimental results show that applying MFO2-LP for link prediction outperforms other regarded meta-heuristic algorithms in terms of prediction results and execution time. Consequently, utilising the MFO2 for EVO-LP framework to produce MFO2-LP achieves the best prediction results in a reasonable amount of time.

Finally, processing dataset as in this work, make the dataset composed of number of rows. Each row represents an instance. These instances are independent of each other. This composition of processed dataset facilitates the division process into number of subsets. So, each subset can contain number of rows easily. By division, we can get higher prediction quality and less execution time. This division can be done through applying parallel approach and distribute each subset of data on processors. It is our future work to develop parallel EVO-LP for link prediction problem efficient in both prediction quality and execution time.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

References

Srinivas

Mitra

Link prediction in social networks: role of power law distribution. 1st ed. Springer, 2016, p. 57.

Liben-Nowell

Kleinberg

. The link prediction problem for social networks. In: Proceedings of the ACM CIKM 2003 international conference on information and knowledge management (CIKM’03), New York, 3–8 November 2003, pp. 556–559. New York: ACM.

Gao

Musial

Cooper

, et al. Link prediction methods and their accuracy for different social networks and network metrics. Sci Programming 2015; 2015: 172879.

Jaccard

Etude comparative de la distribution florale dans une portion des Alpes et du Jura. B Soc Vaudoise Sc N 1901; 37(142): 547–579.

Qian

Cheng

, et al. Accurate similarity index based on the contributions of paths and end nodes for link prediction. J Inf Sci 2015; 41(2): 167–177.

Lichtenwalter

Lussier

Chawla

New perspectives and methods in link prediction. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining KDD’10, Washington, DC, 25–28 July 2010, pp. 243–252. New York: ACM.

Al Hasan

Chaoji

Salem

, et al. Link prediction using supervised learning. In: Proceedings of the workshop on link analysis, counter-terrorism and security, on SDM06, Washington, DC, 22 April 2006. Bethesda, MD: Hyatt Regency Bethesda.

Kashima

Kato

Yamanishi

, et al. Link propagation: a fast semi-supervised learning algorithm for link prediction. In: Proceedings of the SIAM international conference on data mining, SDM 2009, Sparks, NV, 30 April–2 May 2009, pp. 1100–1111. Philadelphia, PA: SIAM Press.

Raymond

Kashima

. Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs. In: Proceedings of machine learning and knowledge discovery in databases, European conference, ECML PKDD 2010, Part III, Barcelona, Spain, 20–24 September 2010, pp. 131–147. Berlin: Springer-Verlag.

10.

Cukierski

Hamner

Yang

. Graph-based features for supervised link prediction. In: Proceedings of international joint conference on neural networks, San Jose, CA, 31 July–5 August 2011, pp. 1237–1244. New York: IEEE.

11.

Feng

Zhao

Link prediction in complex networks: a clustering perspective. Eur Phys J B 2012; 85: 1–9.

12.

Guimerà

Sales-Pardo

Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences of the United States of America 2009; 106(52): 22073–22078.

13.

Thi

Hoang

T-A

. Features extraction for link prediction in social networks. In: Proceedings of the 13th international conference on computational science and its applications ICCSA’13, Washington, DC, 24–27 June 2013, pp. 192–195. Washington, DC: IEEE Computer Society.

14.

Feyessa

Bikdash

Lebby

. Node-pair feature extraction for link prediction. In: Proceedings of the IEEE 3rd international conference on social computing (SocialCom), Boston, MA, 9–11 October 2011, pp. 1421–1424. Cambridge, MA: MIT Press.

15.

Shi

Liang

Lee

, et al. Particle swarm optimization-based algorithms for TSP and generalized TSP. Inf Process Lett 2007; 103(5): 169–176.

16.

Chen

A link prediction algorithm based on ant colony optimization. Appl Intell 2014; 41: 694–708.

17.

Bliss

Frank

Danforth

, et al. An evolutionary algorithm approach to link prediction in dynamic social networks. J Comput Sci 2014; 5(5): 750–764.

18.

Barham

Aljarah

. Link prediction based on whale optimization algorithm. In: The international conference on new trends in computing sciences (ICTCS 2017), Amman, Jordan, 11–13 October 2017, pp. 55–60. New York: IEEE.

19.

Elnemr

Zayed

Fakhreldein

MA.

Feature extraction techniques: fundamental concepts and survey. In: Kamila

(ed.) Handbook of research on emerging perspectives in intelligent pattern recognition, analysis, and image processing. Hershey, PA: IGI Global, 2016, pp. 264–294.

20.

Adamic

Adar

Friends and neighbors on the web. Soc Networks J 2003; 25(3): 211–230.

21.

Newman

MEJ

. Clustering and preferential attachment in growing networks. Phys Rev E Stat Nonlin Soft Matter Phys 2001; 64: 025102.

22.

Zhou

Lü

Zhang

YC.

Predicting missing links via local information. Eur Phys J B 2009; 71: 623–630.

23.

Zhao

, et al. Similarity-based link prediction in social networks: A path and node combined approach. J Inf Sci 2016; 43(5): 683–695.

24.

Sleit

Al-Akhras

Juma

, et al. Applying ordinal association rules for cleansing data with missing values. J Am Sci 2009; 5(3): 52–62.

25.

Sleit

Dalhoum

Al-Dhamari

, et al. Efficient enhancement on cellular automata for data mining In:Proceedings of the 13th WSEAS international conference on systems (WSEAS), Rodos, Greece, 22–24 July 2009, pp. 616–620. Stevens Point, WI: WSEAS.

26.

Hammo

Sleit

El-Haj

Enhancing retrieval effectiveness of diacritisized Arabic passages using stemmer and thesaurus. In: Proceedings of the 19th Midwest artificial intelligence and cognitive science conference (MAICS 2008). Cincinnati, OH, 12–13 April 2008, pp. 189–196. Lancaster, UK: Lancaster University.

27.

Yap

Rani

Rahman

, et al. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the 1st international conference on advanced data and information engineering (DaEng-2013), 15 December 2013, pp. 13–22. Singapore: Springer.

28.

Bellman

RE.

Dynamic programming. Princeton, NJ: Princeton University Press, 1957.

29.

Hira

Gillies

A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics 2015; 2015: 198363.

30.

Sheydaei

Saraee

Shahgholian

A novel feature selection method for text classification using association rules and clustering. J Inf Sci 2015; 41(1): 3–15.

31.

Blum

Langley

Selection of relevant features and examples in machine learning. Artif Int 1997; 97(1–2): 245–271.

32.

Das

. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the 18th international conference on machine learning (ICML’01), San Francisco, CA, 28 June–1 July 2001, pp. 74–81. San Francisco, CA: Morgan Kaufmann.

33.

Goldberg

Genetic algorithms in search, optimization and machine learning. Boston, MA: Addison-Wesley, 1989.

34.

Onan

Korukoğlu

A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 2015; 43(1): 25–38.

35.

Sun

Babbs

Delp

. A comparison of feature selection methods for the detection of breast cancers in mammograms: adaptive sequential floating search vs. genetic algorithm. Conf Proc IEEE Eng Med Biol Soc 2005; 6: 6532–6535.

36.

Chandrashekar

Sahin

A survey on feature selection methods. Comput Electr Eng 2014; 40(1): 16–28.

37.

Crepinsek

Liu

S-H

Mernik

Exploration and exploitation in evolutionary algorithms: a survey. ACM Comput Surv 2013; 45(3): 35:1–35:33.

38.

Mirjalili

Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl-Based Syst 2015; 89: 228–249.

39.

Soliman

Khorshid

Abou-El-Enien

Modified moth-flame optimization algorithms for terrorism prediction. Int J Appl Innov Eng Manage 2016; 5(7): 47–58.

40.

Seyedali

Andrew

The whale optimization algorithm. Adv Eng Softw 2016; 95: 51–67.

41.

Kennedy

Eberhart

Particle swarm optimization. In: Proceedings of the IEEE international conference on neural networks, IV, Piscataway, NJ, 1995, pp. 1942–1948. Piscataway, NJ: IEEE Service Center.

42.

Genetic algorithm, https://en.wikipedia.org/wiki/Genetic_algorithm (2017, accessed 1 September 2017).

43.

Price

Storn

Differential evolution – a simple evolution strategy for fast optimization. Dr Dobbs J 1997; 22(4): 18–24.

44.

Saida

Nadjet

Omar

A new algorithm for data clustering based on cuckoo search optimization. In: Pan

Krömer

Snášel

(eds) Genetic and evolutionary computing: advances in intelligent systems and computing, Vol. 238. Berlin: Springer, 2014, pp. 55–64.

45.

Rendón

Abundez

Arizmendi

, et al. Internal versus external cluster validation indexes. Int J Comput Commun 2011; 5(1): 27–34.

46.

Liu

Supervised learning. In: Liu

(ed.) Web data mining. 2nd ed. Berlin: Springer, 2011, pp. 63–132.

47.

Kubat

Matwin

. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, Nashville, TN, 8–12 July 1997, pp. 179–186. San Francisco, CA: Morgan Kaufmann.

48.

Receiver operating characteristic, http://en.wikipedia.org/wiki/Receiver_operating_characteristic (2018, accessed 1 January 2018).

49.

Sripada

Rao

MS.

Comparison of purity and entropy of K-means clustering and fuzzy C means clustering. Indian J Comput Sci Eng 2011; 2(3): 343–346.

50.

Link prediction group (LPG), http://www.linkprediction.org/index.php/link/resource/data (2016, accessed 15 September 2017).

51.

Lü

Chen

Ren

, et al. Vital nodes identification in complex networks. Phys Rep 2016; 650: 1–63.

52.

Pareto principle, https://en.wikipedia.org/wiki/Pareto_principle (2017, accessed 20 September 2017).