Analysis and construction by convolution neural network of link prediction model on social network

Abstract

Estimating similarity using multiple similarity measures or machine learning prediction models is a popular solution to the link prediction problem. The Relation Pattern Deep Learning Classification (RPDLC) technique is proposed in this study, and it is based on multiple neighbor-based similarity metrics and convolution neural networks. The RPDLC first calculates the characteristics for a pair of nodes using neighbor-based metrics and impact nodes. Second, the RPDLC creates a heat map using node characteristics to assess the similarity of the nodes’ connection patterns. Third, the RPDLC uses convolution neural network architecture to build a prediction model for missing relationship prediction. On three separate social network datasets, this method is compared to other state-of-the-art algorithms. On all three datasets, the suggested method achieves the greatest AUC, hovering around 99 percent. The use of convolution neural networks and features via relational patterns to create a prediction model are the paper’s primary contributions.

Keywords

Link prediction problem convolution neural network relation pattern social network

1 Introduction

Online social networks like Facebook, Twitter, and Youtube are exploding, and figuring out how to locate the missing ties has become a critical problem that is drawing academic and industrial researchers [4 , 57].? Or, on Github, a recommender system that helps users locate software tools that fit their interests [58], friend recommendations on various social networks, such as Twitter and location-based Foursquare [19]. Furthermore, this field has had data security issues, which have been addressed in earlier studies [6 , 53–55].

A topological graph representing individuals or organizations can be used to illustrate a social network. There are three popular techniques for predicting whether a pair of nodes or not [52]?will have a connection. Node-based, topology-based, and social theory-based techniques are the three most prevalent types of methods. Meanwhile, based on these metrics, several link prediction algorithms have been proposed, and various machine learning algorithms have been used to improve the efficacy of the link prediction [11 , 44].

In recent years, social network theorists and researchers have suggested a number of methods for solving the link prediction problem. For example, Liang and Shuai proposed an ensemble technique based on neighbor-based metrics to deconstruct conventional link prediction issues into subproblems [11]. Leskovec et al. developed a logistic regression based on node and triad degrees to forecast directed social networks with good accuracy [29]. To solve the link prediction problem, the techniques employed a machine learning architecture.

Since most earlier studies on link prediction using machine learning algorithms based on similarity metrics have ignored the entire pattern of relationship, Due to memory and algorithm constraints, traditional link prediction algorithms cannot manage huge social networks and lose a lot of information in big data environments. To enhance connection prediction in real-world social networks, this study presents a deep learning framework based on novel feature extraction techniques.

To improve link prediction accuracy, a deep learning model and a new feature extraction method are proposed. To extract connection features from the target node’s social network, this study uses standard neighbor-based metrics based on related edges. It also facilitates the creation of heat maps by assessing similarity between nodes. Using a convolution neural network, the proposed technique also detects consistency between two node’s connection patterns.

By assessing path length, the suggested technique reduced the sample size on a vast social network. If the path’s length exceeds a threshold. So dissecting their connection is pointless. The proposed approach enhances link prediction performance by extracting more essential characteristics and leveraging strong deep learning architectures to build prediction models.

In this article, the proposed approach combines deep learning architecture with a new feature extraction method to improve the performance of solving the link prediction issue. The suggested algorithm’s major contributions are mentioned below.

This paper makes the first attempt to use convolution neural network, which is a generally deep learning framework for image classification, to solve this problem.

The proposed novel feature extraction method not only considers the similarity between two nodes in each edge, but also considers the similarity with the most important nodes on the social network.

The procedure for decreasing the sample size by the path length between two nodes can effectively remove the weak relationships and prune the inefficient information.

There are five key components to this article. The first section introduces the topic of link prediction and offers some background information. The second section examines the literature as well as related works. The proposed algorithm is shown in Section 3. The experimental findings are presented in Section 4, and the conclusions and future work are discussed in Section 5.

2 Related work

This section reviews link prediction approaches and conventional deep learning architectures. The first segment introduces topology-based measures, social theory-based metrics, and machine learning-based prediction methods. As a result, topology-based and social theory-based link prediction metrics are frequently employed to compare two nodes’ similarity. The following part examines traditional convolution neural network designs and methods, and highlights their major contributions to increasing link prediction performance and efficiency.

2.1 Link prediction problem of social network

The link prediction issue is concerned with predicting missing social network relationships. The associated link prediction approaches are discussed in this section.

2.1.1 Topology-based metrics

Most similarity measures rely on social network topology. Topology-based metrics were considered. Their work led to numerous topology-based measures [31]. The three most often used topological metrics are discussed here. First, neighbor-based metrics assess two nodes’ similarity. Second, The number of distinct distances between the two nodes estimates the similarity. Finally, random walk based metrics evaluate similarity using random walk steps.

Neighbor-based Metrics. The neighbor-based metrics employ each node’s neighbors to compute similarity. Assume N_x is a social network node, Γ (N_x) is the set of neighbors based on x. In earlier research, 11 measures were developed to compare two nodes in social networks.

Common Neighbors

The following neighbor-based metrics employ Common Neighbor (CN) [8] to compute the intersection of neighbors between two nodes. The more common neighbors they have, the closer they are.

Jaccard Coefficient

The Jaccard Coefficient (JC) normalizes the size of shared neighbors [21]. JC is similar to CN, but it also considers the total neighbors of a pair of nodes.

Sørensen Index

Sørensen Index (SI) focuses on the lower degrees of nodes that have a greater connection probability [46]. The greater SI is because both nodes are low degree.

Salton Cosine Similarity

Salton Cosine Similarity (SCS), which was proposed by Salton and McGill in 1983 [10].

Hub Promoted

Hub Promoted (HP) is the topological overlap based on a pair of nodes, and the HP measure is considered by the nodes with a lower degree [42].

Hub Depressed

Hub Depressed is identical to HP, except it is utilized by nodes with a higher degree. [59].

Leicht-Holme-Nerman

Leich-Holme-Nerman (LHN) is defined as the similarity between two nodes. The higher value of LHN is because both two nodes have many common neighbors [28].

Parameter-Dependent

Parameter-Dependent (PD) was proposed by Zhu et al. [3]. They incorporated a free argument λ to take into account a higher number of neighbors.

Adamic-Adar Coefficient

Adamic and Adar proposed the Admic-Adar (AA) coefficient [1]. It is currently utilized in social networks. Physically, the AA measure is greater, meaning common neighbors are fewer.

Preferential Attachment

Barabási et al. instructed Preferential Attachment [2]. It indicates that more adjacencies lead to a greater similarity assessment. It also solely relies on the neighbors of each node in a pair, not the shared neighbors.

Resource Allocation

Similarly to the Adamic-Adar Coefficient, Zhou et al. suggested Resource Allocation in 2009 [59]. But Resource Allocation does not penalize high degree neighbors.

Path-based Metrics. Along with neighbor-based metrics, path-based metrics are widely used to assess node similarity. It looks for a link between two nodes that has numerous paths with small distance. The distance between two nodes is the number of edges between them, thus if the distance is 2, then two edges separate them. There are four metrics suggested for evaluating the similarity of two nodes via their paths.

Local Path

In 2009, L $\ddot{u}$ L et al. developed the notion of Local Path [34], which employs paths with varying distances between nodes, such as 2 and 3. It concerns not just the set of pathways with distance 2, but also the set of paths with distance 3.

Katz Index

Katz Index was proposed by Katz [23], which is the ensemble based on all paths with different distances, to measure the similarity between node pairs.

FriendLink

FriendLink, which was proposed by Papadimitriou et al. in 2012 [40], is a similarity metric based on counting number of all path with different distance between two nodes, and it is similar to Local Path.

Relation Strength Similarity

Relation Strength Similarity(RSS), which was proposed by Chen et al. in 2012 [7], is calculated as the similarity score based on relation strength R (N_x, N_y) between two non-neighboring nodes.

Random Walk based Metrics. Beyond the above-mentioned indexes, random walk based metrics presume that methods randomly choose a node and the node has a probability of walking to either neighbor. Finally, the random walk metrics can create the random walk model for finding similarity between two nodes.

Hitting Time

Hitting Time(HT) is the number of random walk steps, that calculates the steps based on node N_x reaches to N_y [13]. Therefore, HT is defined as HT (N_x, N_y), and the HT is not symmetric because the set of paths based on N_x is not same as the set of paths based on N_y, that means HT (N_x, N_y) ≠ HT (N_y, N_x). The physical definition of HT is that a lower value means that two nodes can possibly establish a link.

Average Commute Time

The Average Commute Time(ACT), which is based on Hitting Time, was proposed by Liu and L $\ddot{u}$ in 2010 [33]. It is defined that ACT is the average number of both the random walk step from node N_x to node N_y, and the step from node N_y to reach to node N_x.

SimRank

SimRank(SR) was proposed by Jeh and Widom in 2002 [22], which is the random walk method to compute the similarity between neighbors of two nodes, if neighbors are similar then two nodes are also similar. For the physical definition, a lower value of SR means that the two nodes meet faster and the link probably exists.

Random Walk with Restart

Pan et al. introducted Random Walk with Restart(RWR) [39] for estimating the importance of node N_b to node N_a. The node N_a randomly walks to neighbors with probability (1 - c) or N_a goes back to itself with probability c. It is similar to HT, but has a probability to go back to the previous node.

Social Theory based Metrics. In addition to topology-based metrics, modern social theories such as node centrality [32], structural balance [5, 29], community [41] and closure are used to improve algorithms for predicting missing links or solving other problems in social networks.

Structural Balance

Social psychology cites the Structural Balance hypothesis [5]. This theory’s hypotheses are twofold: 1) balanced triadic, e.g. two buddies share a buddy or foe. 2) imbalanced triadic, e.g. two enemies share a buddy or foe. Leskovec et al. presented a structural balance link prediction model [29].

Weak Tie and Node Centralities

There is a link prediction model proposed by Liu et al. [32], and this similarity measure is based on the node centrality degree, which is the set of neighbors, and a weak relation or tie. If the centrality degree is larger, the node is more important.

2.2 Deep learning

Deep Learning is based on multi-layer artificial neural networks and training with gradient descent, that have several main types, are popular, including convolution neural network, recurrent neural network, such as Long Short-Term Memory (LSTM) and Gate Recurrent Units (GRU), and Generative Adversarial Network (GAN).

This section presents the state-of-the-art convolution neural network architectures and techniques, including LeNet5, AlexNet, VGG net, dropout, Adam Optimization, and other popular convolution neural network architectures or techniques in recent years.

2.2.1 Activation functions

This part introduces the activation functions, because those functions can convert the input value to an output value for the next layer with a different property. There are many activation functions to convert the original output in each convolution or fully-connected layer that can possibly avoid vanishing the gradient problem, which means the output value of previous layer becomes extremely small, or gradient exploding problem, which means the output value of the previous layer becomes extremely large. This part presents several universal activation functions as the following description and mentions the characteristic for using rectified linear units in the proposed algorithm.

Sigmoid

There is a standard logistic function, called sigmoid, that delimits the real value into the range [0, 1]. However, the sigmoid squashing the value into a very small output range would result in a vanishing gradient problem in the multilayer neural networks.

Tanh

The Tanh function squashes the real value into the range[-1, 1] that also has the vanishing gradient problem. The mathematical function is tanh (x) =2σ (2x) -1.

ReLU

To figure out the problem of the vanishing gradient problem using sigmoid and Tanh functions, the Rectified Linear Unit (ReLU) was proposed. ReLU is popularly used in CNN architectures in recent years, which was first proposed by Nair et al. [37]. ReLU is different from sigmoid and Tanh function because ReLU does not squash the input value, which is presented as the following Equation ?? and Fig. ??. ReLU has two main properties: 1) The input value x > 0 is not squashed, which removes the problem of a vanishing gradients; and 2) The output value will be 0, if the input value x is negative, it makes the deep neural network train faster. However, an issue called dead ReLU problem is a drawback which results in some units to never be activated. Hence, there are many variations of the ReLU function to solve the dead ReLU problem, including Leaky Rectified Linear Unit (LReLU) [35], Self-normalization rectified Linear Unit (SeLU) [25], and Exponential Rectified Linear Unit (EReLU) [9].

2.2.2 Dropout

For solving the overfitting problem, which is a general problem in training the model of CNN, dropout was proposed by Srivastave et al. in 2014. Dropout is a technique for solving overfitting [47]. The primary idea of dropout is to randomly drop neurons from the neural network during the training step.

2.2.3 Adam optimizer

In addition, the learning rate is the most important hyperparameter for gradient descent. There are many approaches that find the best solution and Adam optimizer is one of the powerful algorithms to set up the learning rate for gradient descent because it achieves good results, is fast, and can dynamically adjust the learning rate. It was proposed by Kingma and Ba in 2015 [24] who combined the popular optimizer advantages of both AdaGrad [12] and RMSProp [18].

2.2.4 LeNet5

LeNet5 was proposed as a seven-layer CNN architecture by LeCun et al. in 1998 [27] that established the foundation of CNN architecture by three main layers, including convolution, pooling, and fully-connected layer, and used to recognize MNIST dataset, which contains all 58,527 digital samples of hand-written numbers. It obtained high performance based on the sigmoid or Tanh non-linear function and used average pooling to reduce the computational cost.

2.2.5 AlexNet

In 2012, AlexNet was constructed by Krizhevsky et al. [26] to win the challenge in ImageNet ILSVRC 2012 by reducing the top-5 error to 15.3%. AlexNet is quite similar to LeNet5, but it is more complex, bigger, and deeper. There are several new techniques to be used:

ReLUs function trains faster than Tanh and sigmoid, and solves the vanishing gradient problem.

Spreading the network across two Nvidia GeForce GTX580 GPUs with parallelization to drastically reducing the cost of computation.

Using Local Response Normalization (LRN). Because ReLUs have the characteristic that do not normalize the input feature, so they need a layer to facilitate generalization. LRN leads to promote the top-5 error rate from 13% to 11% on the CIFAR-10 dataset.

Max pooling layer replaces average pooling layer that is the overlapping scheme with stride smaller than the kernel and can avoid overfitting.

Data augmentation that generates a new image by translations and horizontal reflections.

Using dropout to randomly drop neuron units to avoid overfitting.

2.2.6 VGG

VGG net was introduced in 2014 by Simonyan and Zisserman [45] and is similar to AlexNet, but only uses 3 × 3 kernel in convolution layers and 2 × 2 kernel in pooling layers. Let bigger kernel disassemble to 3 × 3 kernel, e.g. 5 × 5 kernel can disassemble into two 3 × 3 kernels, 7 × 7 kernel can disassemble into three 3 × 3 kernels, that lead to a reduction in the amount of parameters and have a better feature extraction ability. VGG16 and VGG19 are commonly used nowadays.

2.2.7 Other state-of-the-art CNN architectures

Furthermore, there are many CNN architectures to accelerate the growth of this field. For example, Google Inception v1 was proposed by Szegedy et al. in 2014 [48] that introduces the concept for inception module, which combined maxpooling and several convolution kernels with varying sizes, and removed the fully-connected layers. Google Inception v2 was proposed by Szegedy, Christian, et al. in 2016 that implements the batch normalization in every layer of the convolution neural network to scale the output value for improving the performance [49]. Google Inception v3 was proposed by Szegedy et al. in 2016 [50] that disassemble the convolution kernels, such as 7time7 kernel disassembles to 7time1 and 1time7. RestNet was proposed by He et al. in 2016 that is a very deep network based on 152 layers and mentioned the residual blocks to improve the performance [17]. And Gao Huang et al. proposed DenseNet in 2016, which connects each layer to every other layer in a feed-forward fashion [20].

3 The proposed method: RPDLC algorithm

This study used a deep learning method called Relation Pattern Deep Learning Classification (RPDLC) to identify missing edges on a social network. Extracting features and developing deep learning classifiers were involved.

3.1 The neighbor-based metrics

The proposed approach uses five neighbor-based metrics: JC, SI, HP, HD, and LHN. These measurements are simple, quick, and widely used.

3.2 The proposed pseudo code

In Algorithm 1, the neighbor-based features is generated, including JC, HD, HP, SI, and LHN metric, between N_a and N_i ∈ N, which is identical to N_b.

Algorithm 1
The Pseudo Code of RPDLC Algorithm

Input: social network dataset, S (the positive samples with edges, S_p the negative samples without edge, S_n) the set of nodes, N the threshold value of N_influence, δ, the threshold value of the shortest path between nodes, $\bar{L}$

Output: The accuracy of the results; AUC

1: set the predefined δ value, where 1≤δ ≤ N;

2: defined N_influence is the collection of important nodes.

3: n = |N|;

4: defind N_d is the collection of CentralityDegree (N)

5:

6: while |N_influence| < δ do

7: N_influence adds the node N_h with the highest value in N_d;

8: N_d removes the node N_h;

9: end while

10:

11: for a from 0 to n do

12: for i from 0 to n do

13: P_a = (N_a, N_i) excerpts the values from (JC, HD, HP, SI, LHN);

14: end for

15: for b from 0 to n do

16: if the shortest distance between N_a and N_b < $\bar{L}$ then

17: for i from 0 to N_.length do

18: P_b = (N_b, N_i) excerpts the values from (JC, HD, HP, SI, LHN);

19: end for

20: end if

21: end for

22: normalize values by z-score normalization;

23: H = P_a + P_b;

24: end for

25:

26: while Accuracy and AUC do not achieve the termination condition do

27: train CNN prediction model by Section 3.3

28: end while

29: return Accuracy and AUC;

The algorithm then combines the Na and N_b feature vectors and uses z-score normalization (Equation 1) to reshape to a new matrix Features (m, len), where m = 5 ×2 represents five distinct neighbor-based metrics (JC, HD, HP, SI, LHN) multiply two nodes per edge, and len represents all N in the social Finally, the missing link prediction model employed a convolution neural network architecture like AlexNet or VGG16.

$\begin{matrix} Z_{i} = \frac{X_{i} - μ}{σ} \\ where μ is mean, and σ is standard deviation . \end{matrix}$ (1)

To implement the Algorithm 1, it was found out the problem on the large network, if N is very big, the computational complexity and consumption would experience an exponential growth. Besides, the datasets of the social networks are usually imbalanced, which have large disparity between the number of majority and minority categories. The reasons result in extreme time of extracting feature and difficultly training a good enough classifier of convolution neural network algorithm with the imbalanced data.

Therefore, there is an approach in Algorithm 1 to improve the efficiency and to reduce the consumption of computational time. The proposed approach is based on the shortest path distance between a pair of nodes, which indicates the higher value of shortest path distance for the weaker relationship, and vice versa. Hence, $\bar{L}$ is defined as the value of the threshold for the shortest path distance, and $\bar{L}$ would reduce the number of S_n set to solve the imbalance data problem and to decrease the computational time.

Algorithm 1 enhances performance. The number of nodes increases with the size of the social network, resulting in an incalculable calculation for extracting characteristics from the dataset. As a result, degree centrality is used to reduce the number of parameters in each sample. The degree centrality of a node is one of the most often used measures for measuring the number of edges to a node. The degree centrality in directed social networks has two types: in-degree is the number of incoming edges and out-degree is the number of outgoing edges, while in undirected social networks there is only one kind. The proposed approach selects important nodes by ranking all nodes with degree centrality and utilizes δ as the threshold value for determining influential node size. Thus, the most significant nodes in the social network dataset are defined as N_influence. Algorithm 1 uses this method to substantially decrease computing resources.
3.3 Convolution neural network

After completing the process of feature extraction, categories by the features are classified, which was extracted using the above algorithm. We introduce the deep learning architecture in the following section to build the proposed prediction model.

Convolution Neural Network (CNN) has become the major research in computer vision over the past few years. It is a special type of neural network based on two primary ideas: 1) To find images as the 2D or 3D structure that have a high correlation between the neighborhood of pixels. CNN relies on sharing the features and each channel uses the same kernel function for all pixel locations. 2) To introduce the concept of pooling that makes the network to gradually see larger portions of the input images, which decreases the effects of small variations in position. In the following sections, we mention the important layers and concepts of CNN, including the convolution layer, pooling layer, and fully-connected layer.

Convolution Layer Convolution layer is the one of most important steps in any CNN architecture. In the convolution layer, there are three primary points that can improve the performance: sparse interactions, parameter sharing, and equivariant representations [15]. The sparse interaction refers to making tbe kernel smaller than the input images. The parameter sharing lets the set of similar parameters to be reused in the model and the equivariant representations means the input changes leading to the output changing in a similar way.

Pooling Layer Pooling layer is often implemented after convolution layer that can progressively reduce the size of input images to decrease the number of parameters and computation. The pooling function outputs the summary statistic value to delegate the certain location [15].

Fully-connected Layer The fully-connected layer flattens all neurons in the previous layer and computes the matrix multiplication, which is similar to the regular neural network operation. After feedforward, CNN uses backpropagation and gradient descent function to close in the approximate global optimal solution.

After examining the strong approaches for CNN in section 2.2, the proposed method’s deep learning framework leverages the principles to create the CNN architecture for predicting missing links. Preliminarily, the framework constructs layers similar to AlexNet and VGG nets, which stack convolution layers with just 3 × 3 kernels and max pooling layers, respectively This allows us to utilize SeLU as an activation function for translating output values across layers, which normalizes the output value and solves the vanishing gradient problem. Second, the system employs dropout to reduce overfitting while enhancing prediction model generalization. Finally, the framework uses Adam Optimization to regulate the training speed. To summarize, the proposed technique uses a unique feature extraction methodology and CNN framework to predict potential links in the social network domain.

4 Experiments

This section describes our experimental setup and outcomes. The experiment’s goal is to anticipate missing connections in three social network datasets.

4.1 Experimental setup

4.1.1 Description of datasets

To evaluate the suggested approach of link prediction in diverse social networks, we used three real-world datasets.

Jazz, which is the collaboration network between Jazz musicians by Pablo M. Gleiser and Leon Danon [14]. Each node is a Jazz musician and each edge exists if two musicians played in at least one common band.

NetScience, which is the co-authorship network of scientists working on network theory and experiments, as compiled by Newman [38].

Facebook is the friendship network of participants using Facebook, the famous social networking services. This dataset consists of friends lists, ego network, and node features (profile), if a participant follows another, they will have an existing edge [36].

The detailed information about these datasets is described in Table 1. Each original dataset is split into ten equal-sized subsamples. Two subsamples were kept as test data for the testing model, while the other eight were utilized as training data for the training model.

Table 1
Statistics of three datasets

DBs Nodes Edges Avg. Avg. Degree Type Shortest Dist.

Jazz 198 5484 27.70 2.24 Undirected

NetScience 1589 2742 1.73 5.53 Undirected

Facebook 4039 2742 21.85 4.70 Undirected

DBs	Nodes	Edges	Avg.	Avg. Degree	Type Shortest Dist.
Jazz	198	5484	27.70	2.24	Undirected
NetScience	1589	2742	1.73	5.53	Undirected
Facebook	4039	2742	21.85	4.70	Undirected

4.1.2 Evaluation metrics

The studies employ a common metric, the Area Under Receiver Operating Characteristic Curve (AUC) [16], to assess the accuracy of link prediction models. The AUC metric is the likelihood that a randomly picked absent edge would outperform a randomly chosen non-existent edge [59]. There are numbers of n^′ when the missing edge has a higher score than the non-existing edge and numbers of n^″ when the score is equal between the missing and non-existing edge among n independent comparisons. Equation 2 is used to define the AUC.

In this case, the AUC will be about 0.5. AUC >0.5 implies greater algorithm performance.

4.2 Experimental results

Experimental Results are divided into four sections. First, the experiment presents the result by improving the proposed algorithm with the shortest path and discusses the effects. Second, the experiment was implemented by the method with varying sizes of influence nodes and presented the result, which showed the effects for the computation cost of feature extraction and visualization of feature map with each size of influence nodes. Third, the experiment showed the AUC of the RPDLC algorithm on three datasets, including Jazz, NetScience and Facebook, further to compare with other link prediction algorithms. In addition, the experiment treated about five different similarity metrics that discussed the effectiveness of each metrics used independent with the proposed algorithm and finally, comparing the three datasets.

4.2.1 Result for different threshold of shortest path

The different threshold values of the shortest path influenced the size of samples as shown in Fig. 1 and Table 2 on three datasets, if the threshold value was bigger resulting in a larger size of samples. However, the proposed approach also avoided the special case between two nodes, when it was not possible to reach from N_a to N_b by any path. This extreme case has no link, such as the NetScience dataset, and has a very low degree network that can ignore the abnormal sample by the proposed approach. To sum up, it contains two main functions, including sample reduction by setting a different length of the path and precluding a sample of two nodes without any path.

Fig. 1

Effectness by different shortest path.

Table 2

Sample size by each threshold of shortest path

	2	3	4	5	6
Jazz	26,986	37,120	38,879	39188	39,204
NetScience	15,033	27,763	47,283	71,451	96,547
Facebook	2,896,641	6,878,493	12,740,053	15,305,223	15,982,437
	7	8	9	10	11
Jazz	39,204	39,204	39,204	39,204	39,204
NetScience	119,057	133,755	142,461	148,085	151,385
Facebook	16,297,901	16,313,521	16,313,521	16,313,521	16,313,521

4.2.2 Result of different amount of influence nodes

In this section, the result is presented on the effect of varying sizes of influence nodes that showed the large size of influential nodes. This increased the computation time and the experiment results are depicted in Fig. 2. However, the size of influential nodes would affect the performance for training CNN model of link prediction, a larger influence size results in a lower accuracy. Because the large size of influence probably contains the not important nodes or outliers. Hence, we need to find the balance between both costs of computation and performance for the training model.

Fig. 2

Extraction time by different size of influence node.

4.2.3 Comparisons with other algorithms

Five baseline and link prediction methods were considered: CN [8], AA [1], RWR [39], FL [40], and PNC [56]. In RWR, c = 0.001, l = 3, and p = 1. The RPDLC algorithm is detailed in Table 3. Table 3a shows the configuration for extracting features for three datasets with various shortest path thresholds. It was decided to take all samples from Jazz dataset and use 6 as the cutoff value for shortest path because the dataset was small. As a low degree social network, the experiment utilized 11 as the threshold value on the NetScience dataset. The threshold value decreased samples by discarding samples with no path between two nodes and nearly held samples with any path distance. In the end, the Facebook dataset was selected because it was huge and a minimal threshold was employed to minimize the sample size. A collection of hyperparameters for the proposed CNN models is provided in Table 3b. The CNN training environment uses Tensorflow 1.5.0 and a GTX 1080Ti GPU.

Table 3
Detail of setup for constructing model

(a) Setup for feature extraction

Datasets Shortest path Influence node

Jazz 6 10

NetScience 11 10

Facebook 3 10

(b) Setup fo training CNN models

Jazz NetScience Facebook

Learning rate 0.001 0.001 0.001

Batch size 128 128 128

Optimizer adam adam adam

Iteraters 3000 3000 3000

Activation SeLU SeLU SeLU

Dropout 0.5 0.5 0.5

	(a) Setup for feature extraction
Datasets	Shortest path	Influence node
Jazz	6	10
NetScience	11	10
Facebook	3	10
	(b) Setup fo training CNN models
Jazz	NetScience	Facebook
Learning rate	0.001	0.001	0.001
Batch size	128	128	128
Optimizer	adam	adam	adam
Iteraters	3000	3000	3000
Activation	SeLU	SeLU	SeLU
Dropout	0.5	0.5	0.5

As shown in Figs. 3 , 4 and 5, the RPDLC algorithm achieved the highest performance of AUC on all datasets. The PNC algorithm had the second highest performance of AUC next to the RPDLC algorithm on all datasets. The FL obtained the third performance of AUC on Jazz dataset that had the highest degree. The RWR obtained the third performance of AUC on both NetScience and Facebook datasets, which was the lowest degree of social network with a lower probability to randomly walk to incorrect paths. The similarity-based method using neighbor-based metrics, including AA and CN, only considered the set of common neighbors and an approximate way to predict links which resulted in a average performance. In consequence, RPDLC algorithm combined the simple features based on neighbor-based metrics and CNN models to greatly get a better performance of AUC than PNC, FL, RWR, AA and CN algorithms in the experiment.

Fig. 3

AUC value of different algorithms in Jazz dataset.

Fig. 4

AUC value of different algorithms in NetScience dataset.

Fig. 5

AUC value of different algorithms in Facebook dataset.

4.2.4 Comparisons within each metrics

In the final section of results, we further compare the performance of each neighbor-based metric we used, as depicted in the Figs. 6 , 7 and 8. First, as shown in Fig. 6 which presents each metric with the high value of AUC, if the social network had a high degree. Second, as shown in Fig. 7 the metrics including HP, HD and LHN are presented, which obtained the average performance, probably because of the social network with a very low degree. The low degree social network leads to HP, HD, LHN find out an uncommon value, such as 1 or 0, and lose the characteristic of those metrics. Finally, as shown in Fig. 8 that presented SI, HP and LHN metric could probably obtain an average performance by the set of degree with an uneven distribution.

Fig. 6

AUC value of RPDLC algorithm by each metric in Jazz dataset.

Fig. 7

AUC value of RPDLC algorithm by each metric in NetScience dataset.

Fig. 8

AUC value of RPDLC algorithm by each metric in Facebook dataset.

5 Conclusions and future work

Encouraging nodes and neighbor-based metrics were used to extract unique features from the convolution neural network model. The main contribution is a deep learning link prediction method that can accurately anticipate missing relationships. This study also used deep learning, a strong categorization approach, to improve the effectiveness of predicting relationships in social networks. The suggested framework outperforms AA, CN, RWR, PNC, and FL. The suggested technique also lowered computing costs by leveraging influential nodes and shortest paths connecting nodes, reducing sample size and parameters per sample, and increasing the training model efficiency. To train the convolution neural network model for predicting missing links, the suggested technique employed five neighbor-based metrics simultaneously: JC, SI, HD, HP, and LHN. According to the experimental results, the size of influential nodes influenced feature extraction time, the shortest path threshold reduced sample size (from 16,313,521 to 6,878,493), and independent training on each metric resulted in worse performance than training on five metrics simultaneously.

5.1 Future work

Despite the RPDLC algorithm’s great performance, there is still potential for development. First, the RPDLC method uses CNN, which raises the cost of computation, even with the proposed influential nodes strategy. Moreover, the RPDLC method ignores alternative similarity metrics and solely employs neighbor-based metrics to extract features. The RPDLC algorithm will be extended to include similarity-based metrics (e.g. FL, Katz, and RSS), random walk-based metrics (e.g. ACT, SR, and RWR), social theory-based metrics (e.g. homophily, structure balance, and node centrality), and combinations of different sets of metrics. We should also test the RPDLC method on real-world social network datasets. Finally, more study is needed in complicated networks to determine the RPDLC algorithm’s stability and multi-usefulness.

References

Adamic

L.A.

and Adar

, Friends and neighbors on the web, Social Networks 25(3) (2003), 211–230.

Barabási

A.L.

and Albert

, Emergence of scaling in random networks, Science 286(5439) (1999), 509–512.

Barabâsi

A.L.

, Jeong

, Néda

, Ravasz

, Schubert

and Vicsek

, Evolution of the social network of scientific collaborations, Physica A: Statistical Mechanics and Its Applications 311(3-4) (2002), 590–614.

Barbieri

, Bonchi

and Manco

, Who to follow and why: link prediction with explanations, In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1266–1275, ACM, 2014.

Cartwright

and Harary

, Structural balance: a generalization of heider’s theory, Psychological Review 63(5) (1956), 277.

Chen

C.-M.

, Chen

L.-L.

, Gan

W.-S.

, Qiu

and Ding

W.-P.

, Discovering high utility-occupancy patterns from uncertain data, Information Sciences 546 (2020), 1208–1229.

Chen

H.-H.

, Gou

, Zhang

X.L.L.

and Giles

C.L.

, Discovering missing links in networks using vertex similarity measures, In The 27th Annual ACM Symposium on Applied Computing, 138–143, ACM, 2012.

Chen

J.-L.

, Geyer

, Dugan

, Muller

and Guy

, Make new friends, but keep the old: recommending people on social networking sites, In The SIGCHI Conference on Human Factors in Computing Systems, 201–210, ACM, 2009.

Clevert

D.A.

, Unterthiner

and Hochreiter

, Fast and accurate deep network learning by exponential linear units (elus), ArXiv Preprint ArXiv:1511.07289, 2015.

10.

Dillon and Martin , Introduction to modern information retrieval: G. salton and m. mcgill. mcgraw-hill, new york (1983), xv+ 448 pp., 32.95 isbn 0-07-054484-0, 1983.

11.

Duan

, Ma

, Aggarwal

, Ma

T.-J.

and Huai

J.-P.

, An Ensemble Approach to Link Prediction, IEEE Transactions on Knowledge and Data Engineering 29(11) (2017), 2402–2416.

12.

Duchi

, Hazan

and Singer

, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011), 2121–2159.

13.

Fouss

, Pirotte

, Renders

J.M.

and Saerens

, Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation, IEEE Transactions on Knowledge and Data Engineering 19(3) (2007), 355–369.

14.

Gleiser

P.M.

and Danon

, Community structure in jazz, Advances in Complex Systems 6(4) (2003), 565–573.

15.

Goodfellow

, Bengio

and Courville

, Deep Learning, MIT Press, 2016.

16.

Hanley

J.A.

and McNeil

B.J.

, The meaning and use of the area under a receiver operating characteristic (roc) curve, Radiology 143(1) (1982), 29–36.

17.

K.-M.

, Zhang

X.-Y.

, Ren

S.-Q.

and Sun

, Deep residual learning for image recognition, In The IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

18.

Hinton

, Srivastava

and Swersky

, Neural networks for machine learning lecture 6a overview of mini-batch gradient descent, 14(8) (2012), 2.

19.

Hristova

, Noulas

, Brown

, Musolesi

and Mascolo

, A multilayer approach to multiplexity and link prediction in online geo-social networks, EPJ Data Science 5(1) (2016), 24.

20.

Huang

, Liu

and Weinberger

K.Q.

, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4700–4708, IEEE, 2016.

21.

Jaccard

, Étude comparative de la distribution florale dans une portion des alpes et des jura, Bull Soc Vaudoise Sci Nat 37 (1901), 547–579.

22.

Jeh

and Widom

, Simrank: a measure of structural-context similarity, In The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 538–543, ACM, 2002.

23.

Katz

, A new status index derived from sociometric analysis, Psychometrika 18(1) (1953), 39–43.

24.

Kingma

D.P.

and Ba

, Adam:Amethod for stochastic optimization, The 3rd International Conference for Learning Representations, 2014.

25.

Klambauer

, Unterthiner

, Mayr

and Hochreiter

, Self-normalizing neural networks, Proceedings of the 31st international conference on neural information processing systems, 972–981, 2017.

26.

Krizhevsky

, Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, In Advances in Neural Information Processing Systems, 1097–1105, 2012.

27.

LeCun

, Bottou

, Bengio

and Haffner

, Gradient-based learning applied to document recognition, IEEE 86(11) (1998), 2278–2324.

28.

Leicht

E.A.

, Holme

and Newman

M.E.

, Vertex similarity in networks, Physical Review E 73(2) (2006), 026120.

29.

Leskovec

, Huttenlocher

and Kleinberg

, Predicting positive and negative links in online social networks, In The 19th International Conference on World Wide Web, 641–650, ACM, 2010.

30.

and Chen

, Recommendation as link prediction in bipartite graphs: A graph kernel-based machine learning approach, Decision Support Systems 54(2) (2013), 880–890.

31.

Liben-Nowell

and Kleinberg

, The link-prediction problem for social networks, Journal of The Association for Information Science and Technology 58(7) (2007), 1019–1031.

32.

Liu

H.-F.

, Hu

, Haddadi

and Tian

, Hidden link prediction based on node centrality and weak ties, EPL (Europhysics Letters) 101(1) (2013), 18004.

33.

Liu

W.-P.

and Lü

L.-Y.

, Link prediction based on local random walk, EPL (Europhysics Letters) 89(5) (2010), 58007.

34.

Lü

L.-Y.

, Jin

C.-H.

and Zhou

, Similarity index based on local paths for link prediction of complex networks, Physical Review E 80(4) (2009), 046122.

35.

Maas

A.L.

, Hannun

A.Y.

and Ng

A.Y.

, Rectifier nonlinearities improve neural network acoustic models, In Proc. Icml 30 (2013), 3.

36.

McAuley

and Leskovec

, Learning to Discover Social Circles in Ego Networks, NIPS 2012 (2012), 548–556.

37.

Nair

and Hinton

G.E.

, Rectified linear units improve restricted boltzmann machines, In The 27th International Conference on Machine Learning (ICML-10), 807–814, 2010.

38.

Newman

M.E.J.

, Finding community structure in networks using the eigenvectors of matrices, Physical Review E 74(3) (2006), 036104.

39.

Pan

J.-Y.

, Yang

H.-J.

, Faloutsos

and Duygulu

, Automatic multimedia cross-modal correlation discovery, In The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 653–658, ACM, 2004.

40.

Papadimitriou

, Symeonidis

and Manolopoulos

, Fast and accurate link prediction in social networking systems, Journal of Systems and Software 85(9) (2012), 2119–2132.

41.

Pirouz

, Zhan

and Tayeb

, An optimized approach for community detection and ranking, Journal of Big Data 3(1) (2016), 22.

42.

Ravasz

, Somera

A.L.

, Mongru

D.A.

, Oltvai

Z.N.

and Barabási

A.L.

, Hierarchical organization of modularity in metabolic networks, Science 297(5586) (2002), 1551–1555.

43.

Scellato

, Noulas

and Mascolo

, Exploiting place features in link prediction on location-based social networks, In The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1046–1054, ACM, 2011.

44.

Sherkat

, Rahgozar

and Asadpour

, Structural link prediction based on ant colony approach in social networks, Physica A: Statistical Mechanics and its Applications 419 (2015), 80–94.

45.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

46.

Sørensen

, A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons, Biol Skr 5 (1948), 1–34.

47.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

and Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

48.

Szegedy

, Liu

, Jia

Y.-Q.

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9, IEEE, 2015.

49.

Szegedy

, Vanhoucke

, Ioffe

, Shlens

and Wojna

, Rethinking the inception architecture for computer vision, In The IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826, 2016.

50.

Szegedy

, Vanhoucke

, Ioffe

, Shlens

and Wojna

, Rethinking the inception architecture for computer vision, In The IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826, 2016.

51.

Tang

J.L.

, Chang

S.-Y.

, Aggarwal

and Liu

, Negative link prediction in social media, In The Eighth ACM International Conference on Web Search and Data Mining, 87–96, ACM, 2015.

52.

Wang

G.N.

, Gao

, Chen

, Mensah

D.N.

and Fu

, Predicting positive and negative relationships in large social networks, PloS One 10(6):(2015), e0129530.

53.

T.Y.

, Chen

Y.Q.

and Chen

C.M.

, An enhanced pairing-based authentication scheme for smart grid communications, Journal of Ambient Intelligence and Humanized Computing, 2021

54.

T.Y.

, Yang

, Lee

and Chen

C.M.

, Improved ECC-based three-factor multiserver authentication scheme, Security and Communication Networks 2021 (2021), 6627956.

55.

T.Y.

, Lee

Z.Y.

, Obaidat

M.S.

, Kumari

, Kumar

and Chen

C.-M.

, An authenticated key exchange protocol for multi-server architecture in 5g networks, IEEE Access 8 (2020), 28096–28108.

56.

C.-M.

, Zhao

X.-L.

, An

and Lin

, Similarity-based link prediction in social networks: A path and node combined approach, Journal of Information Science 43(5) (2017), 683–695.

57.

Zhao

Z.-Y.

, Zhang

X.-J.

, Zhou

, Li

, Gong

M.-G.

and Wang

Y.-Q.

, Hetnerec: Heterogeneous network embedding based recommendation, Knowledge-Based Systems 204 (2020), 106218.

58.

Zhou

and Kwan

, Missing link prediction in social networks, In International Symposium on Neural Networks, 346–354. Springer, 2018.

59.

Zhou

, L”u

L.-Y.

and Zhang

Y.-C.

, Predicting missing links via local information, The European Physical Journal B 71(4) (2009), 623–630.

Analysis and construction by convolution neural network of link prediction model on social network

Abstract

Keywords

1 Introduction

2 Related work

2.1 Link prediction problem of social network

2.1.1 Topology-based metrics

2.2 Deep learning

2.2.1 Activation functions

2.2.2 Dropout

2.2.3 Adam optimizer

2.2.4 LeNet5

2.2.5 AlexNet

2.2.6 VGG

2.2.7 Other state-of-the-art CNN architectures

3 The proposed method: RPDLC algorithm

3.1 The neighbor-based metrics

3.2 The proposed pseudo code

4 Experiments

4.1 Experimental setup

4.1.1 Description of datasets

Table 1 Statistics of three datasets DBs Nodes Edges Avg. Avg. Degree Type Shortest Dist. Jazz 198 5484 27.70 2.24 Undirected NetScience 1589 2742 1.73 5.53 Undirected Facebook 4039 2742 21.85 4.70 Undirected

4.2 Experimental results

4.2.1 Result for different threshold of shortest path

5.1 Future work

References

Table 1
Statistics of three datasets

DBs Nodes Edges Avg. Avg. Degree Type Shortest Dist.

Jazz 198 5484 27.70 2.24 Undirected

NetScience 1589 2742 1.73 5.53 Undirected

Facebook 4039 2742 21.85 4.70 Undirected