Abstract
Link prediction plays a predominant role in complex network analysis. It indicates to determine the probability of the presence of future links that depends on available information. The existing standard classical similarity indices-based link prediction models considered the neighbour nodes have a similar effect towards link probability. Nevertheless, the common neighbor nodes residing in different communities may vary in real-world networks. In this paper, a novel community information-based link prediction model has been proposed in which every neighboring node’s community information (community centrality) has been considered to predict the link between the given node pair. In the proposed model, the given social network graph can be divided into different communities and community centrality is calculated for every derived community based on degree, closeness, and betweenness basic graph centrality measures. Afterward, the new community centrality-based similarity indices have been introduced to compute the community centralities which are applied to nine existing basic similarity indices. The empirical analysis on 13 real-world social networks datasets manifests that the proposed model yields better prediction accuracy of 97% rather than existing models. Moreover, the proposed model is parallelized efficiently to work on large complex networks using Spark GraphX Big Data-based parallel Graph processing technique and it attains a lesser execution time of 250 seconds.
Keywords
Introduction
A social network is an interaction of social objects and a set of connections between these objects. A graph data structure represents the social network in that vertices means actors and edges (links) illustrate the relationships or interactions among actors [1]. The cooperation and interactions among the people have become convenient due to the speedy improvement of the internet. Online social networks like Facebook, WhatsApp, and Twitter play a vital role in our daily lives which provide us a platform to transfer details with each other [2].
The social networks generate vast volumes of data every day and have some distinct characteristics like quality, semi-structure, and direct interaction between real-world entities on social networks. Link prediction is a significant factor in estimating link mining and forecast the information about the links to be formed in the future based on node interactions [3]. It is not solely utilized in social networks however it may be applied in alternative fields. Moreover, it has often accustomed to discover interactions among proteins in the bioinformatics field [4].
Link prediction has been utilized to produce advice systems in electronic commerce. It facilitates to search the hidden terrorist criminal gangs in the security field. Anomaly detection is one of the ways to identify malicious users over the network [5]. This model presented an advanced community detection system to perform link prediction techniques and a Bulk Synchronous Parallel programming model. Subsequently, the information diffusion and influences the network happenings to predict cascade related to viral events. It is associated with several areas and there are many types of correlation algorithms planned to resolve link prediction [6].
A few of the existing link prediction methods developed so far are based on the neighboring nodes considering only the adjacent information rather than the community information of each neighboring node. Even though some of the methods considering the community information, but they have categorized entire communities as inter and intra-communities and they failed to concentrate on each and every community individually. Because of, not considering each community separately, the participation of the individual communities for predicting the new link between the nodes is missed.
Typically, in real-world social networks, the participation of every individual node for predicting the future link between the two nodes is based on the impact of communities of the neighboring nodes. For example, in real-world social networks such as Twitter, Facebook, and LinkedIn, the friend request is not only based on simply nearby locations additionally occupation, interests, working area are also considered and also the friend request may be based on the current trends on social networks. Similarly, the proposed work considering the neighborhood community information to predict the future link among the two nodes, and the node that belongs to the community having the high importance in the entire network can participate more than the other neighboring nodes during the link prediction process. So in this paper, the community-based linkage prediction approach has been proposed by restructuring the existing similarity indices by using every community importance. To find the community importance, the new Community Centrality (CC) measure is proposed based on basic graph measures.
The various novel contribution of this paper is as follows: A novel CC measure has been proposed which is based on degree, closeness, and betweenness centralities of graph network. Due to this novel consideration, the proposed method achieved a higher precision value for larger datasets. According to the new CC, each neighboring node’s importance is determined towards their community. This strategy helps to attain maximum accuracy during the evaluation process. The standard similarity metrics are redefined based on CC and the new parallel community-based link prediction method is proposed. The 13 real-world dataset networks has been employed to evaluate the proposed method.
The structure of this paper is organized as follows: The related work of link prediction is summarized in section 2. The preliminary concept utilized in the proposed method is explained in section 3. Section 4 depicts the proposed methodology. The performance evaluation is described in Section 5. The results and performance analysis are demonstrated in section 6. Finally, the conclusion is enumerated in section 7.
Related works
The following section summarizes the various background studies related to link prediction. A novel graph-based convolutional neural network model has been proposed for multi-relative link prediction in multimodal networks [7]. In this model, the side effects of two drugs are identified when they are used together by considering the side effect as the link prediction problem to find the side effect (link), the features of the drugs only utilized rather than neighboring information of the represented graph.
Most of the link prediction techniques are roughly categorized into two sets: probabilistic-based and similarity-based methods. A few of the existing probabilistic methods require the information of vertex attributes that are added to the identified network formation. However, it requires more computational time to derive the link prediction which is applied in massive networks [8]. Next, similarity-based methods usually allocate a score to each pair of nodes to calculate its existing probability. It is a widely unitized technique due to its fairness of prediction and its effectiveness in computation [9]. These methods are often categorized into native data-based (local indices) and global data-based (global indices). Local indices are effective than global indices as lesser calculation complexity, but the local index accuracy is smaller than global indices.
Most of the existing works concentrated on local indices since local indices are easier to implement. It takes lesser time complexity even the given real-world networks are extremely large. Nevertheless, most of the classical local indices except Resource Allocation (RA) and Adamic-Adar (AA) based on familiar neighbours assume that every common neighbour equally participates in fining the link likelihood among the vertex that it constraints the index performance. It is observed from the artificial and real-world network experimentation that the similarity-based index performance enhances due to the growth of the community structural networks [10]. So, the intra-community neighbors alone has been considered for identifying further link among the nodes. The significance of network structure is identified and several researchers have projected numerous link prediction strategies depends on cluster or community information to enhance the prediction accuracy [11, 12].
Another way to improve the prediction accuracy, common neighbors, and RA indices is the utilization of the intra-community information rather than inter-community neighbours of the node pair [13]. This model failed to concentrate on recall metrics in the network which paves the way to attain the lower efficiency. In [14], a new WIC methodology has been developed to enhance the prediction efficiency using data from within cluster common neighbors instead of considering all neighbors from the clusters. This method causes lesser accuracy in the social network.
In [15], a novel model has been introduced to address the link prediction in the social network. It depends on the conditional probability that both the vertices belong to the same cluster by using the neighbors from the given cluster. This model failed to concentrate on the precision metrics in the network. In [16], a novel methodology has been presented that depends on Intra-Com data and RA index with an accuracy of 10 real networks. The model leads to attaining lower prediction accuracy due to higher algorithm complexity. Furthermore, the missing link is not properly identified throughout the network. In [17], the community compliant link prediction has been presented in which the links are divided into an inter-community and intra-community category. By giving more importance to intra-community links with the tuning parameter, the missing link is identified.
Although most of the existing models employed the effect of Intra community familiar neighbors but neglect the impact of inter-community neighbors and also they have used community labels for identifying inter and intra community neighbors. Further, they do not focus on each neighboring node based on their community importance. To overcome this, the proposed model considered both inter-community and intra-community neighborhood nodes of each pair. It also anticipated the significance of the neighboring nodes based on their community importance. The significance of communities will be identified through the CC measurements of their leader node.
Preliminaries
Initially, the preliminaries describe the concepts of the link prediction problem and the objective of the proposed work. Then, three commonly utilized centrality measures of the graph and community detection technique have been discussed in this section.
Link prediction problem
Assume an undirected graph G (V, E) where V and E are the set of vertices and edges. In this graph, the proposed model taking all possible edges

(a). Network Data E.

(b). Training Set E T .

(c). Nonexistent Set E non .

(d). (d). Probe Set E P .
Consider the missing link set E non , which contains all unavailable (non-existent) links (x, y), where x, y ∈ V are disconnected node pairs. Then, the set of common neighbours between node pairs x, y is denoted by λx,y = Γ (x) ∩ Γ (y). Wherever Γ (x) indicates set of x’s neighbours and Γ (y) indicates set of y’s neighbours. To evaluate the probability of forming the new link betwe disconnected nodes, the similarity Score Sx, y is calculated between every pair of vertices by using a link prediction algorithm. Hence, the possibility of linking nodes (x, y) is raised because of the risen score Sx, y. The whole graph links are randomly split into 2 sets to calculate the accuracy of a link prediction technique. One is training set E T , which is considered know information, and another one is prob set E P which is utilized for testing the algorithm. Certainly, E T ∪ E P = E, E T ∩ E P = φ and E T ∪ E P ∪ E non = S. The main objective is to define an efficient technique that generates the similarity score of every edge in E P greater than the links in E non .
Degree centrality is a simple centrality measure that counts the number of neighbors node. This centrality can be mathematically represented in Equation (1) which can be normalized between 0 and 1 by dividing it by several anticipated links (n-1) of each node [18].
In a graph network, the centrality of the closeness of a node is the shortest path with an average length among neighborhood nodes. The central node is closer to other nodes can be illustrated as [19]
Betweenness centrality establishes the node functions as a bridge among the shortest path between two nodes. It is the measure quantifying control measurement for human enabled communication between the humans. The vertices have a high probability of choosing the random condition of the shortest path among n-linear selected vertices with a consideration of high betweenness.
The betweenness centrality can be mathematically described as follows [20]:
Girvan-Newman algorithm has been introduced for community detection to detect the communities on a large-scalreal-world network graph [21]. It is a greedy optimization method that attempts to enhance the modularity of a partition of the graph. The algorithm consists of two phases, which are repeated until the modularity cannot be improved. During the first phase, every node is observed as an individual cluster.
For each node p (p = 1, …, N) its neighbours q (q = 1, …, N) are examined for whether the modularity improves if p is detached from its cluster and into the cluster of node q is allocated. Then the node p is assigned to the respective cluster, which improves the growth in modularity. Nevertheless, it employs the method within the case of positive growth. If no positive growth of modularity is not generally attained, the node p stays in the same cluster. The operation will be performed repeatedly and successively executed for all nodes until no enhancement in modularity can be attained. A node is often considered and allocated several times. Hence the first phase terminates when a local maximum has been reached that is if no single node movement can increase the modularity.
Based on the clusters derived within the first phase, a replacement network is made within the 2nd phase, whose nodes are now the clusters derived from the earlier phase. To derive the weights for the links among the clusters, the sum of the weights of the links among the nodes of two respective clusters is employed. If such a replacement network was derived with meta-cluster then the process of the first phase is provided to the successive network and therefore modularity score can be further improved. The combination of both phases is named as pass. Such passes are repeatedly administered until there is no more change within the cluster and maximal modularity is attained.
GraphX
GraphX is HADOOP’s Spark-based Parallel graph processing API that is utilized for graph-relevant techniques such as PageRank, Shortest path, community detection, recommendation system, and link finding [23]. It integrates the benefits of graph-parallel and data-parallel systems by representing the graph-relevant operations inside Spark Framework. It improves the Spark RDD by launching a special Graph at the high level of abstraction; that graph is called a directed multigraph in which properties are attached with both vertices and edges [24, 25].
The other speedy graph processing method such as Graph Lab and Giraph maintains Spark’s flexibility, fault tolerance. Graphs are used to represent or store different information such as geospatial data, social network data, email communication networks, co-author networks, paper citation networks, and web page link structures. In GraphX property, the Graph concept is introduced in the edge table and vertex table. The vertex table will be associated with vertices of a graph where it contains vertex id and properties such as community label, degree, and other Meta information. The edge table is associated with the entire graph, and it contains source node id, destination node id, and properties of edges such as relationship, similarity score.
Proposed methodology
In the proposed model, the given social network graph is initially partitioned into communities using the proposed parallel Louvain community detection algorithm. From the derived communities, the CC of each community is calculated using the degree, closeness, and betweenness centralities of basic graph measures. By using the calculated CC, the modified CC-based similarity indices are derived. In the CC-form of local indices, more importance is specified to the intra-community neighbors than inter-community neighbors. The derived CC-form of the local index is employed in the proposed parallel Spark-based link prediction to predict the future link among the nodes pair for the given input graph.
Proposed parallel louvain community detection algorithm
This section introduces the parallel Louvain method to find out the communities using GraphX API. The core part of our similar Louvain method is depicted in Algorithm 1. Note that whenever the modularity value is largest, a good community separation is achieved. In the Louvain method, every vertex needs to investigate its neighbours communities and select a new community depends on function to extend the evaluated switch in modularity value. The parallel method considers every node as the choice concurrently in lieu of a sequential manner and updating the graph status after every variation. Since the obtained result is from a concurrent way where a few results will be inaccurate and do not increase modularity. After successive iterations, the community selections are more stable and hence, the outcome is achieved which is nearly close to the results of the sequential algorithm.
In Algorithm 1, lines 1-2 reads the input graph from edge list files into the Spark environment in RDD format. Like the original serial Louvain algorithm, the proposed parallel version is also a greedy method, so it continuously performs the graph partition process until getting a good modularity score. So that the initial value for the modularity score is set as low as possible Q new = -1 in line 3. Lines 4-17 represent the entire process of the Louvain algorithm with a parallel approach. In lines 5-6 the entire graph is randomly partitioned as K subgraphs and the current modularity scores of all subgraphs are calculated separately and aggregated as Q curr before the Louvain phases are started.
The two core phases of the Louvain method are performed on each subgraph in a parallel manner is described in lines 7-10. In line 11, a new modularity score is calculated combined by merging all sub-graphs. In line 12, the improvement in the partition will be checked. If the improvement in new partitions is not more than the threshold value (ρ = 0.01) the algorithm is stopped and returns the final partitions as communities. If the improvement is more than ρ (ρ = 0.01), then in line 15, the new modularity score is updated, and in line 16, the new complete graph with newly identified clusters is constructed from the new subgraph partitions generated after the complete iteration of the Louvain algorithm in order to take the next iteration of the complete Louvain phases. All sub-graphs are aggregated to generate a new clustered graph.
The parallel Louvain algorithm is also generating the same number of communities as a serial algorithm with a similar structure. Since the serial Louvain algorithm operates on each node of the graph individually, so we have parallelized the Louvain method in order to generate the communities as quickly as possible. And also, this community information is the input of our link prediction process. The execution time of serial Louvain and the proposed parallel Louvain algorithms on five larger data sets are demonstrated in Fig. 2. Since the proposed method uses the same idea of serial Louvain, the number of communities derived by parallel Louvain is similar to the serial version.

Comparison between Serial and Parallel Louvain Algorithm.
From Fig. 2, it is observed that the execution time of the serial Louvain algorithm is increased rapidly from few minutes to few hours when the network size is increased, and the possibility to reaches the time limit and memory limit exceeds while processing the huge real-time data sets having millions of nodes. To mitigate this problem, the proposed Spark-based parallel Louvain method has been employed to derive the communities and the execution time of our parallel Louvain algorithm on a 4-node spark cluster on five large data sets are depicted in Fig. 2. The execution time of this cluster is rapidly reduced and nearly 10 times better than the serial algorithm since the spark is based on in-memory computation.
Network Graph G is divided into k number of communities by the proposed parallel Louvain method, and the labels C1, C2, … Ck (k > 1) are utilized to definite the derived communities, and x
Ci
represents that node x associated to a distinct community with label Ci. βx,y specifies the set of familiar neighbors of node pair (x, y), described by βx,y = {w|w ∈ Γ (x) ∩ Γ (y)}. Then, the set of common neighbors belong to community Ci of (x, y) is described by
For example, consider a simple network with 2 communities C1, C2 in Fig. 3. A node pair (3, 5) having three common neighbors, i.e., λx,y = {2, 4, 6},

A simple network with 2 community structures.
After deriving the communities using a parallel Louvain algorithm, the significance of each community is calculated using three centralities (degree, closeness, betweenness) as demonstrated in Algorithm 2. Here, communities derived from parallel Louvain algorithm are taken as input and the community centralities of each community are generated. The degree, closeness, and betweenness centralities of each node of every community are calculated in lines 5-8. In this work, the average of betweenness and closeness is taken as global centrality since it is based on entire graph structures and it is estimated at line 9. Since the betweenness and closeness centrality are the global importance of a particular node in the entire graph so that we have named their average value as global centrality. Subsequently, we have considering the both local and global importance of each node in the graph, we have combined both by taking the average of them and we named as structural centrality of a node. The structural centrality can be evaluated at line 10. The largest structural centrality of a node that belongs to a particular community Ci is taken as community centrality of that community Ci; this entire process is described in lines 11-14. Finally, lines 16-18 represent the community centrality value of each community is normalized between 0 and 1.
Then consider the community centrality of each community will be CC1, CC2, …, CC
n
. The CC-based similarity measure is illustrated in Equation (4). Afterward, the entire communities are divided into two categories intra-community C
intra
and inter-community Cinter.
Even in Equation (4), different communities participated during the link prediction process but it provides equal importance to all the communities. But in a real-world scenario, the neighboring nodes within the same community influences more than other neighboring nodes during the link prediction process. So that we have given more importance to intra-community neighbors rather than inter-community neighbors. So that we have restructured Equation (4) as Equation (5). The term cc intra . represents the community centrality of the Intra community to which the predicting node pair belongs.
In order to predict the links between the nodes, 9 local similarity indices depend on familiar neighbors are frequently utilized in link prediction problems and they are listed in Table 1. The proposed work identified the new link prediction indices CC formulated by connecting the similarity measure suggested in Equation (5) with the local similarity indices described in Table 1. Unlike the community-based link prediction techniques [26–29], the CC-based similarity algorithm can be applied on some specific local indices but applied for all 9 local indices and evolve the family of CC-based local similarity indices. The local indices (9) and CC forms are demonstrated in Table 2.
Proposed CC-based 9 local indices
The conventional link prediction techniques can be carried out serially and contain a wide variety of matrix calculations. For massive graphs, the matrix calculations yield more time and are even hard to compute, and the memory limit will be exceeded. Therefore, the link prediction method has been parallelized based on GraphX [24]. The principle process of parallel CC-based link prediction is depicted in Algorithm 3. Every node transmits and receives the node information as well as community details to and from their neighborhoods during every super step. After that, the similarity score of every predicted link can be calculated during every core step.
Once communities are derived using the Parallel Louvain algorithm, the edge RDD is split into two sets as training RDD edges (EDT) and testing RDD edges (EDP). Next, the input graph is split into K Processors where each processor gets one partition and performs similarity calculations for links that include both non-existent and testing edges. There are 4 predominant super steps in this parallel technique, as shown in Fig. 4. From Fig. 4, it can be noticed that during super step 1, every node sends its precise ID and community detail to all its neighbourhood nodes.

Graphx CC-based link prediction process.
During super step 2, every node composes a message based on the details collected from its neighbour nodes, including neighbours list and their ID, community label, and node degree, and then pass this composed message to all its neighbour’s node. Meantime, a new edge RDD (ED
New
) is constructed to sre these testing edges along with non-existing edges and the link has similarity attributes. During super step 3, every node gets both node and community details about all the feasible triplets. The neighbor details collected over various paths mean that there are many familiar neighbours among the current node and neighbors.During super step 4, at every node x, the number of familiar neighbours belong to every possible community Ci is counted as
The empirical evaluation of the proposed CC-based link prediction on 10 real word networks from different fields is demonstrated in the following section. The running time evaluations of the proposed parallel approach on 4 large networks are discussed.
Simulation setup
The experiment can be conducted using a 10-fold CV to validate the prediction efficiency of the anticipated technique. Consequently, the noticed links set E is randomly partitioned as 2 sets in every round where one part will be Training set E T that contains 90 percent of the links and another part is testing set E P that includes the remaining 10 percent of links.
The training set has been employed to produce the communities for every network by using a parallel Louvain algorithm. Afterward, the similarity scores are computed for all probe links (E p ), non-existent links using every local index, W forms, and CC forms. Here the value for the parameter L is set based on the number of links having a similarity value larger than the average similarity score among all the links from E p and E N on . Finally, the AUC measurement is exploited to measure the prediction efficiency of all indices. The matching pairs are randomly selected from non-existent links and probe links n = 8000 times for the AUC computation process.
Dataset
The 13 real-world networks have been employed to evaluate the proposed CC-based local indices are enumerated in Table 3. The basic graph network properties are depicted in Table 4. |V| and |E| represents the number of nodes and edges, D avg is the average network graph degree, the clustering coefficient of the network is signified by C coef and PLavg implies the average path of the social network.
Description of 13 real world data sets
Description of 13 real world data sets
Graph Properties of 13 real world data sets
To measure the accuracy of the proposed link prediction model, the popular metric AUC is utilized and observed for the given score of non-observed edges. The probability that randomly selected testing edges from set E
P
yields a larger score than a randomly chosen non-existent edge from E
N
on
.
The proposed CC-based indices model is initially compared with 9 primary local indices. Additionally, the proposed model has been compared against the W-form link prediction technique in which only within-cluster familiar neighbours are utilized [37]. The investigation can be performed on the CC form of each local indices with αarameter from α = 0 to α = 1 to examine how well the intra-community common neighbours contribute to the link prediction process. The CC form produces results nearly the same as local indices at α = 0.5. Typically, when the α value increases from 0 to 1, it may be observed that the AUC value initially increases drastically a then gradually slows down and finally drops. For example, the CN-CC and RA-CC are taken for consideration and this drift will be noticeable among the majority of 10 data sets designated in Figs. 5 and 6.

a. AUC results of RA-CC on first 5 data sets.

b. AUC results of RA-CC on second 5 data sets.
From Figs. 5 and 6, it is manifest that the AUC value of CN-CC increases faster than RA-CC values. When the α value exceeds 0.1, the AUC value of CN-CC increases immediately and reaches the highest value at α = 0.8. It gradually decreases when the α value goes beyond the value of 0.8 and reaches the least value when α = 1. The AUC value of the first 5 real-time datasets achieved the value of 0.8702 whereas the next 5 different datasets obtained the value of 0.8675.

a. AUC results of CN-CC on first 5 data sets.

b. AUC results of CN-CC on second 5 data sets.
Similarly, the AUC score of RA-CC reaches the highest value when the α value reaches 0.6 and gradually decreases when α > 0.6 and reaches to least value at α = 1. Moreover, both CN-CC and RA-CC have good prediction efficiency on each of the 13 network data sets when α ∈ (0.6, 0.8). Specifically, in RA-CC, there is a remarkable enhancement on AUC score when α > 0.5 rather than when α < 0.5. Nonetheless, the prediction efficiency of both CN-CC and RA-CC reaches the much least value when α = 1, which is less than the prediction accuracy of primary indices.
Since all other inter-community common neighbours are not considered during the link prediction task at α = 1, some salient details are missing for the prediction process and finally result in the worst prediction accuracy. The AUC results of CC forms of other indices are nearly identical to those of CN-CC and RN-CC, indicating that intra-community neighbours have more effect on link prediction than neighbours from other communities even they also participated in the link prediction process. It is valuable to involve every neighbour from different communities with their community importance individually and prioritize familiar intra-community neighbours from other community neighbours to quantify the performance of the overall link prediction process.
Table 5 illustrates all CC forms’ prediction accuracy using AUC on 10 real-world networks. An enormous value of AUC on each data set is represented in bold font. Every CC-based local index’s specified results are achieved based on the optimal parameter α∈ (0, 1) subject to the maximum AUC on each network. Initially, all CC-based indices have better accuracy on link prediction than their standard indices on the 13 networks data sets.
AUC values of each similarity indices on ten larger real-world data sets
The enhancement is apparent among Metabolic and karate, which can be associated with their topological structures. Next, the CC forms of local indices have the best accuracy on link prediction rather than corresponding W forms on the 13 social networks data sets. Because the W forms of indices only considering the intra familiar neighbours, which are merely equivalent to CC forms during α = 1. Additionally, all W forms have poor accuracy on link prediction than their primary indices on 9 of the 13 networks excluding the Karate, Ca-AstroPh, Soc-Epinions 1, and Amazon 0601 when α = 1. Thus, the proposed CC-based local indices are experimentally verified to be an efficient technique for link prediction problems.
Tables 6 and 7 describes the prediction accuracy of our proposed modified Similarity indices based on the precision and recall results along with basic default indices and their W form. The proposed CC form has superior precision and recall value on 9 data sets among the ten small-scale networks, particularly the CN-CC form has attained the best outcomes on 5 datasets. Moreover, nearly all CCs form having superior prediction outcomes than their primary standard indices and W form on 10 datasets. Finally, the proposed CC-based local indices are demonstrated experimentally to be an efficient technique for link prediction.
Precision values of each similarity on ten larger real-world data sets
Recall values of each similarity on ten larger real-world data sets
The experiments have been implemented on Spark environment with 1,2,4,8 and 16 nodes cluster. Each system has similar hardware: Intel i5 3.1 GHz processor, 16 GB RAM, and 1TB HD. The Spark version utilized is spark 1.6. The proposed parallel CC-Based Link Prediction method is simulated with an overall of 13 datasets. The execution time of the proposed approach over 3 larger data sets, Ca-AstroPh, Soc-Epinions 1, Amazon0601 is depicted in Fig. 7 and the AUC value of these data sets for RA, CA, RA-W, and CA-W along with the proposed RA-CC and CA-CC are exhibited in Table 8. Here, the execution time ranges from a few hundred seconds to a few thousand seconds and it rises relatively linearly with the increases in the size of the network.

Execution time of proposed method on 3 large networks.
AUC value of the proposed method over existing methods
The running time reduces with the growth of cluster size, which is clearer on the larger datasets. The running time of all the datasets is decreased at a cluster with 4 nodes less than 50% of the execution time at a single node cluster. Moreover, the speed of running time is reduced slowly when the cluster size exceeds 8. These results inferred that the proposed CC-form has a massive advantage compared to the serial algorithm. The improvement increases more as the parallelism grows where the number of nodes to be chosen is based on the size of the datasets. When the number of nodes within the cluster goes beyond a particular number then the performance enhancement is not apparent.
The execution efficiency of the proposed parallel CC-based link prediction method has been evaluated using the speedup metric. The speed of every spark cluster with different sizes can be calculated using the following equation.
The speedup of the proposed parallel algorithm for different size clusters is demonstrated in Fig. 8. According to Fig. 8, it is evident that the speedup of spark clusters is started to double at cluster size = 2 and the doubling is continued till to reach the cluster size = 8. Afterward, the speedup is increased only gradually for the cluster size with 10 and 16. From this, it is clear that if the number of processors in the spark clusters is increased, the speedup is reached the maximum value. The speedup is not improved even the processors further increase in the spark clusters.

Speedup of different clusters on 3 larger datasets.
Another vital contribution of the current research is to compare the proposed method over the other existing methods like Graph Neural Network (GNN) [48], Direct-Indirect Common Neighbours (DICN) [49], and Node-Coupling Clustering (NCC) [50]. We have compared one of our proposed Community based metric RA-CC with the above-mentioned existing method in terms of the AUC value that is illustrated in Table 9. From Table 9, it is clear that the proposed RA-CC having a superior AUC value on 7 data sets than the other 3 methods. Since GNN only uses the direct neighboring nodes to find the link likelihood between the node pairs, so community information has not participated during the link prediction process. Even though DICN utilizes the second-order common neighboring nodes along with direct common neighboring nodes during the link prediction process, the performance is not better than the proposed CC-based method.
AUC values of 3 large-scale data sets
The results also expose that the GNN method having a better AUC value in two larger data sets Soc-Epinions1 and Amazon0601 compare to other methods. Since nodes are sparsely distributed in the large-scale networks, the score-based methods including the proposed method are struggling to reach a high score than the GNN method. Since GNN is an unsupervised learning method, the feature vector for GNN is formed using 9 standard similarity indices. Finally, it is manifest that the proposed RA-CC provides better prediction accuracy in all small-scale and medium-scale networks, it also yields good accuracy in large-scale networks than the other 2 score-based methods.
In this work, the proposed model was derived from a group of link prediction indices based on CC. The neighbour from different communities are participated along with their community importance and specified the preference for intra-community neighbours than other community neighbours. The proposed model yields better predictions about the future link by restructuring the 9 classical indices. The experiments have been evaluated on 13 real-world datasets. It demonstrates that the prediction performance is enhanced with the influence of all common community neighbours and it turns into poor while considering the intra-community common neighbours. It was compared with the w-form of local similarity indices then the proposed CC-based link prediction has achieved better prediction accuracy. Furthermore, the proposed model has superior precision and recall value on 9 datasets among the ten small-scale networks. These experiment results manifest that the proposed model has been considered an efficient technique for link prediction.
