Prediction optimization of diffusion paths in social networks using integration of ant colony and densest subgraph algorithms

Abstract

One of the most important challenges of social networks is to predict information diffusion paths. Studying and modeling the propagation routes is important in optimizing social network-based platforms. In this paper, a new method is proposed to increase the prediction accuracy of diffusion paths using the integration of the ant colony and densest subgraph algorithms. The proposed method consists of 3 steps; clustering nodes, creating propagation paths based on ant colony algorithm and predicting information diffusion on the created paths. The densest subgraph algorithm creates a subset of maximum independent nodes as clusters from the input graph. It also determines the centers of clusters. When clusters are identified, the final information diffusion paths are predicted using the ant colony algorithm in the network. After the implementation of the proposed method, 4 real social network datasets were used to evaluate the performance. The evaluation results of all methods showed a better outcome for our method.

Keywords

Diffusion paths prediction information diffusion patterns densest subgraphs ant colony algorithm centrality

1. Introduction

Social networks are new approaches of presenting connections and interactions among people by using the latest web technologies. These cross-border networks have attracted many users and their efficiency has been improving over time [27]. They are known as a suitable place for producing and sharing important materials and topics. A social network is a social structure composed of various individual or organizational groups. In other words, it is a mapping of all related edges among studying vertices and can be used to identify the social position of users. These concepts are often analyzed using the theory of graphs which plays an important role in investigating and analyzing social networks [20,30].

One of the recent research interests in the field of information diffusion is how information is distributed and what the diffusion patterns are [18,24]. The challenge is to find an efficient way to predict the paths of diffusion based on real data which has many applications in various areas such as ecommerce, virus resources detection, posting in blogs, gossip news, etc. [43]. Many approaches have been proposed for predicting information diffusion paths so far. For example, in [16,32] the popularity of news is used to find the future influenced nodes so that the nodes which influenced in the past are considered to find the diffusion pattern based on a series of parameters. Then nodes that will be influenced in the future are predicted as a function of time. The common issue with previous works is that a large number of nodes in high-dimension networks bring the necessity of dividing nodes into clusters. For instance, these researches [8,42] used the Louvain community detection algorithm to group nodes in clusters and then predict the diffusion paths. One of the issues of the Louvain algorithm is that there is no control over the centers of clusters and their numbers. The centers of clusters are important and can be used in information propagation. To improve the algorithm of predicting diffusion paths, we use the densest subgraph algorithm [2] instead of Louvain to identify the centers of clusters. The modified densest subgraph algorithm is used to find the centers of clusters. This algorithm creates a subset of nodes with maximum independency and determines the primary centers of clusters. So it is not necessary to predefine the number of input clusters. After creating the densest subgraphs, the final paths of information diffusion in the network are predicted through the ant colony algorithm on the created clusters.

In the remainder of this paper, Section 2 introduces the preliminary knowledge of the techniques and technologies used in this paper. Section 3 reviews the previous works and findings. The proposed method is described in Section 4. Section 5 discusses the performance evaluation of our approach and previous works and finally, the conclusion is given in Section 6.

2. Preliminary knowledge

2.1. Machine learning

Machine learning, as one of the broadly used applications of artificial intelligence, explores and regulates the ways in which computer systems can obtain the learning capability. The purpose of machine learning is to enable a computer system to learn gradually through increasing learning data and improve their performance in doing tasks. Each machine learning program needs a “training dataset” to learn what kind of information it should expect and what kind of information the planner is looking for [5]. Majority of machine learning methods use the supervised learning approach. In this approach, the system tries to learn from prior artifacts that are available to it. In other words, it tries to learn and detect patterns based on the given examples [33].

Machine learning provides a significant improvement in various fields such as text processing, risk assessment, intrusion detection, face recognition, recommender systems, image retrieval, medical diagnosis, case-based reasoning, bioinformatics, social network analysis, etc. For instance, Angelo et al. [7] used the association rule, which is a technique in machine learning, to find the unfair tuples recommender-data.

2.2. Information diffusion

Mark Granovetter first investigated the issue of information dissemination in social networks [14]. He assumed information is disseminated in social networks, without considering any specific mechanisms. Information such as news, innovations, and viruses start with a collection of seed nodes and spreads across the network [3]. The dissemination of information has been investigated in a wide range of fields including health care [19], and complex networks. One of the most important tasks on networked systems is to understand, model and predict the rapid events and developments in the network body. This is mainly due to a well-known fact that the discovery of the network structure results in predicting the patterns of social events such as their shape, size and growth known as information diffusion [23]. Various techniques and methods have been proposed for modelling the diffusion of information on homogeneous and heterogeneous networks such as those discussed in [17].

2.3. Ant colony optimization

Swarm intelligence is a systematic property in which agents interact locally and the collective behavior of all agents results in convergence at a point close to the optimal global response. In swarm intelligence, each particle has a relative autonomy in these algorithms that can move around the solution space and must work with other particles. One of the well-known algorithms of swarm intelligence is ant colony optimization which is widely used in optimizations to find the global optimum [36].

In the early 1990s, Ant System (AS) algorithm was proposed by Dorigo et al. [9], as a new heuristic to solve difficult optimization problems. Then, he proposed Ant Colony Optimization Algorithm (ACO) as a multi-agent solution for optimizing algorithms such as Travelling Salesman Problem (TPS). This algorithm is inspired by the behavior of ants that are able to find the shortest path between the nest and food source and also adapt to environmental changes [10].

A graph is used in the ant colony algorithm to represent the problem that needs to be solved or optimized. In this graph, the nodes represent the problem states, and the edges represent the transition between states. The spilled pheromones on the paths are the information collected by ants during the food search which is represented by a value on edges (for example, $τ (i, j)$ represents the pheromone between node i and j). An exploratory value is also represented on each edge as basic information in the problem (for example, $η (i, j)$ represents the value of the edge between node i and j). In general, the ant colony optimization algorithm works as follows. Initially, some ants are randomly placed on graph nodes. Each ant creates a possible solution to the problem by applying a State Transition Rule continuously. Ants prefer to move to states which are connected by shorter edges. A general Pheromone Update Rule applies when all ants finished their routes. In this rule, some pheromones are evaporated on all edges. Also, ants put pheromones on the edges that exist in their solutions. In other words, better paths receive more pheromone. This process continues until a predefined stopping condition is reached [28].

2.4. Densest subgraph

The issue of identifying a maximum-weighted subgraph with some size constraint is a well-known and widespread problem in data mining and has been studied in social network analysis. In this problem, a subset of nodes that have the highest ratio of edges between pairs of nodes should be found on the given input graph [38]. In 1984, Goldberg developed a polynomial-time algorithm to find the maximum density subgraph using a max flow technique [31]. There are many variations on the densest subgraph problem. One of them is the densest k subgraph problem, where the goal is to find the maximum density subgraph on exactly k vertices.

Let $G = (V, E)$ be an unweighted undirected graph. The density of a subgraph $S \subseteq V$ , is defined as $d (S) = \frac{| E (S) |}{| S |}$ where $E (S)$ is a set of edges of the subgraph S and $| S |$ is its cardinality. The maximum density of a graph is denoted as $d^{*} (G) = max S \subseteq V {d (S)}$ . Similarly, the density of a subgraph $S \subseteq V$ in a weighted graph; $G = (V, E)$ ; is defined as $d (S) = \frac{\sum_{e \in E (S)} W_{e}}{| S |}$ , where $E (S)$ is a set of edges of subgraph S and $W_{e}$ is the weight of edge $e \in E (S)$ .

2.5. Node centrality

Identifying the nodes that are more central than others, has been a key issue which has many usages in network analysis [6,25]. Various centrality measurements have been proposed for unweighted networks such as degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and subgroup centrality. There have been several attempts to generalize the aforementioned centrality measurements for weighted networks, but they still inherit the suffering weaknesses of unweighted networks. For example, the degree centrality method is very simple but has little relevance. Global metrics such as betweenness and closeness centrality can better identify important nodes but are not ideal to be applied for large-scale networks due to the computational complexity. To deal with this issue, the Laplacian centrality measure [6] is applied to measure the node centrality in this paper. For a weighted graph G, W, and X matrices are defined as follows: $\begin{array}{l} (1) & W (G) = [\begin{matrix} 0 & w_{1, 2} & \dots & w_{1, n} \\ w_{2, 1} & 0 & \dots & w_{2, n} \\ \cdot & \cdot & \cdot & \cdot \\ w_{n, 1} & w_{n, 2} & \dots & 0 \end{matrix}], \\ (2) & X (G) = [\begin{matrix} X_{1} & 0 & \dots & 0 \\ 0 & X_{2} & \dots & 0 \\ \cdot & \cdot & \cdot & \cdot \\ 0 & 0 & \dots & X_{n} \end{matrix}] \end{array}$ where $X_{i} = \sum_{j = 1}^{n} W_{i, j} = \sum_{u \in N (v_{i})} W_{v_{i}, u}$ which $X_{i}$ is the sum weight of vertex $v_{i}$ and $N (v_{i})$ is the set of $v_{i}$ ’s neighbors. Also, the Laplacian Energy of G is defined as follows which is used for calculating the final centrality value [6]: $\begin{matrix} (3) & E_{L} (G) = \sum_{i = 1}^{n} X_{i} + 2 \sum_{i < j} w_{i, j}^{2} \end{matrix}$

Finally, the Laplacian Centrality $C_{L} (v_{i}, G)$ of vertex $v_{i}$ is defined as follow [6]: $\begin{matrix} (4) & C_{L} (v_{i}, G) = \frac{{(Δ)}_{i}}{E_{L} (G)} = \frac{E_{L} (G) - E_{L} (G_{i})}{E_{L} (G)} \end{matrix}$ where $G_{i}$ is the graph obtained by deleting $v_{i}$ from G.

The complexity of computing Laplacian centrality for network G with n vertices and maximum degree Δ would not be more than $O (n . Δ^{2})$ .

3. Related work

The diffusion of information in networks is similar to the distribution of epidemics but there are differences. For example, time, relationship strength, content, social metrics, and network structure are important factors that can have impacts on the diffusion of information [4,37].

The study of information exchanging processes is one of the most important aspects of Online Social Networks (OSN) analysis. There are three main research categories about information dissemination in OSNs [12]. The first category deals with detecting popular contents which have high probability of distribution among users. In the second category, diffusion routes are modeled by detecting the paths which are more likely to be used in the network (e.g. the tree-shaped paths which are known as cascade models) and finally in the third category, the influential nodes which have high distribution potential are identified. The more influential nodes can generate big cascades because of their properties. In this paper, we focus on the second and third categories, as they are directly influenced by the human cognitive limitation. The first category is more about the type of contents circulating in the network, so it should not be influenced by that limitation (at least not directly) [1].

Several approaches have been proposed for modeling the information diffusion through explicit relationships in social networks. Those approaches are mostly based on linear threshold model [15], independent cascade model [13], and general cascade model [29]. The idea of these models is that nodes may influence each other, so a node may have been influenced by other nodes and represents similar characteristics. It means nodes with high influential impacts (high centrality) can influence on other nodes (with less centrality) on the diffusion paths.

In the linear threshold model, each node is infected at a given time period, if the sum of its incident edges’ weights (which are the neighbors of the last node in the diffusion path) is above a given threshold. The diffusion process is stopped in this model when no new nodes are infected during a certain time period. In spite of simplicity of this mode, it has been largely used to model information diffusion in OSNs [35]. In this approach, either the weights of edges which have diffusion possibilities are the same, or the weights are set based on the maximum estimation likelihood of real diffusion cascades observations.

Granovetter et al. [15] proposed the linear threshold model which determines a threshold value for each node and a weight for each edge. When the accumulated value of all neighbors’ weights of a specific node is greater than the threshold value, it is linked to the diffusion path. The assumption in cascade based models is that the distribution of contents between users in the social network is based on explicit links [34,39].

The independent cascade model, assuming the selection of a node in the diffusion path, is not influenced by neighbors but by a single neighbor [13,22]. It means, the next node is chosen using the independent relationship between two nodes. This model determines a probability value for edges, so the selection takes place based on that probability [29]. Furthermore, the non-explicit relationships are not used in this model.

Lerman et al. [40] recently discussed that the diffusion paths generated by general cascade and linear threshold models are far from being realistic, as they are often largely different from real diffusion paths at OSNs. In general cascade model, the condition of independent cascade mode is eliminated, so all neighbors have influence on the selection of the next path independently, therefore the characteristics of the linear threshold model and the independent cascade model are generalized [1].

4. Proposed method

In this section, we describe our proposed method for enhancing the prediction accuracy of diffusion paths in social networks using the integration of ant colony and densest subgraph algorithms.

In some previous works, community detection algorithms like Louvian were used to improve information diffusion in the network [42]. One of the issues of this method is that it tries continuously to optimize and merge the clusters. In other words, the number of clusters is reduced as the optimization is continued. Moreover, the head clusters are not predefined and also there is no control over them. The benefit of detecting head clusters is that they can be used for optimizing information diffusion. Thus, we used the densest subgraph algorithm to detect the initial head clusters and then clustering the nodes based on them.

4.1. Creating primary clusters based on densest subgraph algorithm

In this stage, the graph nodes are divided into several clusters by using the densest subgraph algorithm. Let’s assume $G = (V, E)$ is an undirected graph. For the cluster of $S \subseteq V$ , we show a set of edges of the subgraph S by $E (S)$ . The density of cluster S; is defined as $ρ (S) = \frac{∣ E (S) ∣}{∣ S ∣}$ where $| S |$ specifies the size of cluster S. The degree of a node $i \in S$ is shown as $\deg_{s} (i)$ and is defined as $\deg_{s} (i) = ∣ {j ∣ (i, j) \in E (S)} ∣$ . The maximum density of graph G is defined as $ρ^{*} (G) = {max}_{S \subseteq V} {ρ (S)}$ [35]. In weighed graphs, the density is defined as $ρ (S) = \frac{\sum_{e \in E (S)} W_{e}}{∣ S ∣}$ where $W_{e}$ is the weight of edge, $e \in E (S)$ , and also for an undirected graph G, the maximum density of the graph is bigger than the threshold which is defined as $ρ_{⩾ k}^{*} (G) = {max}_{S \subseteq V, ∣ S ∣ ⩾ k} ρ (S)$ where $k > 0$ (k is a fixed parameter) [41].

The complexity of finding the densest subgraph of a given graph is NP-hard. Several approximate algorithms have been proposed for directed and undirected graphs [41]. We use one of these algorithms which can find the densest subgraph by adding a limitation of minimum size of the subgraph $ρ_{⩾ k}^{*} (G)$ . Algorithm 1 shows the modified densest subgraph algorithm in [41].

Algorithm 1:

Pseudo code of the modified densest subgraph algorithm

The obtained subgraph of the modified algorithm is used as a set of primary centers. According to densest subgraph algorithms, the centers of clusters should be away from each other as much as possible in terms of similarity. Therefore, the main goal of this stage is to find a subset of $\tilde{S} \subseteq V$ with the minimum number of vertices, k; whose density is minimum. So, the proposed algorithm in [41] is changed in a way to output a subgraph with the lowest density. Algorithm 1 shows the modified pseudo code of the densest subgraph algorithm. In the next stage, the vertices that have degrees over the threshold θ, are detected as candidate vertices which can be removed from the input graph. The candidate vertices are denoted as $\tilde{A} (S)$ and threshold is defined as $θ = 2 * ρ (S)$ . Then, some vertices are removed from the candidate set which is denoted as $A (S)$ . If the obtained subgraph is not null, the algorithm is implemented on the remaining subgraph, finally, the set of vertices is reduced to $S ∖ A (S)$ and algorithm guarantees that the obtained subgraph has at least the minimum numbers of vertices, k.

4.2. Updating primary clusters

The primary centers obtained from the previous stage might not be the best results. Therefore, in the second phase of the clustering algorithm, a new technique is used to find more accurate centers. To this end, in the first step, each node of the graph is assigned to its nearest center to form the primary clusters. Then, to find the new possible center, the sum of similarity values between a candidate center in the cluster, $v_{i} \in c_{j}$ ( $c_{j}$ is the set of nodes in cluster j), and other nodes is calculated by $sum (v_{i}) = \sum_{v_{t} \in c_{j}, v_{t} \neq v_{i}} sim (v_{i}, v_{t})$ . This value is computed for all members of the cluster (the similarities between nodes are predefined in the dataset). Finally, a node is chosen with the highest value as the new center of the cluster: $\begin{matrix} (5) & newcenter = arg max_{v_{i} \in c_{j}} sum (v_{i}) \end{matrix}$

After determining new centers of all clusters, other vertices will join the nearest clusters. This process is continued repeatedly until centers of the clusters remain unchanged. Also, vertices that have not been chosen as centers will join the cluster of the nearest center.

Fig. 1.

Different stages of densest subgraph algorithm; (a) the input graph, (b) the output of the modified densest subgraph algorithm; (c) assigning vertices to clusters (d) updating centers of the clusters and (e) merging clusters.

At the end of the second stage, some clusters may not have enough members to be used in the diffusion paths prediction stage. The small number of members for a cluster, not only causes less accurate results but also reduces the coverage rate of the prediction diffusion paths, so to overcome this issue, clusters with smaller number of members than the threshold are merged to the nearest cluster and the previous cluster is removed in the third stage of the algorithm.

Figure 1 displays an example of the densest subgraph algorithm. In part (a), nodes of a system are modeled as a graph. The edges’ weights in the graph show the similarity between vertices. Part (b) shows the output of the modified algorithm in finding four vertices as the primary centers. Part (c) depicts the primary clusters of the graph which are formed based on the primary centers. In the next stage, the primary centers are replaced with the new centers using equation (5). The new clusters and their members are displayed in part (d). As it can be seen, $C 2$ and $C 4$ have one and two members respectively. Finally, in the last stage $C 2$ and $C 4$ are merged into $C 1$ and $C 3$ ; which is displayed in part (e).

4.3. Initialization of nodes pheromone based on centrality criterion

The information gathered by ants is put on edges along the search process as pheromone, so $τ (i, j)$ shows the pheromone between group i and j. Meanwhile, an exploratory value is put as a representation of primary information on each edge. For example $η (i, j)$ displays the value of the edge among two groups i and j. In this text, we use the sum of similarities between two nodes as the value of primary pheromone on their edge. $\begin{matrix} (6) & η (i, j) = sum (i) + sum (j) \end{matrix}$

4.4. Creating paths based on ant colony algorithm

In this step, the optimal dissemination paths are determined according to the ant colony algorithm. To achieve this goal, each node of the social network is considered as a node in the ant colony algorithm. In general, the process of ant colony optimization algorithm is as follows: first, some ants are placed on some of the nodes of the graph randomly. Then, ants create their own solutions to the problems using the state transition rule consecutively. Ants prefer to go through paths that have high pheromone (shorter edges). When the transition of all ants is finished and their solutions were obtained, a general pheromone update rule is used. In this rule, each ant puts pheromone on edges and also pheromone on some edges evaporates. In other words, the edges of better solutions receive more pheromone. This process continues until the predefined stop condition is met. The transition rule is determined by ant colony optimization algorithm which is a combination of the exploratory information and the pheromone. When an ant is in node r, it chooses node s for the transition by using the rule in equation (7) [42]. $\begin{matrix} (7) & s = max_{u \in j_{r}^{k}} [τ {(r, u)}^{α} . η {(r, u)}^{β}], if q ⩽ q_{0} \end{matrix}$ where $J_{r}^{k}$ is a set of unvisited nodes by ant K in node r. Also α and β are parameters to show the importance of the pheromone, $τ (r, u)$ , and the exploratory information, $η (r, u)$ . When $β = 0$ , the exploratory information on nodes is not used and also when $α = 0$ , the pheromone on the edges is not used, q shows a random number between zero and one and $q_{0}$ is $0 ⩽ q_{0} ⩽ 1$ .

In the probability mode of $q > q_{0}$ , the next node, s is chosen based on probability of $P_{k} (r, s)$ using equation (8) [42]. $\begin{matrix} (8) & P_{k} (r, s) = \{\begin{matrix} \frac{τ {(r, s)}^{α} . η {(r, s)}^{β}}{\sum_{u \in j_{r}^{k}} τ {(r, s)}^{α} . η {(r, s)}^{β}}, & if s \in j_{r}^{k} \\ 0, & otherwise . \end{matrix} \end{matrix}$

Equation (9) represent an example of pheromone update rule in Equation (8) [42] $\begin{matrix} (9) & τ (r, s) = (1 - ρ) τ (r, s) + \sum_{s \in S_{upd}} g (s) \end{matrix}$ $ρ \in [0, 1]$ is the pheromone evaporation parameter and $g (s)$ is a function to determine the solution quality which is known as evaluation function ( $S_{upd}$ is the sum of all nodes in the network). The first stage of ant colony algorithm optimization deals with computing the probability of choosing the next node for continuing the path of each ant. In other words, the next node is determined via transition rule and using a function that influences the node’s suitability and similarity with all previous nodes. Thus, in each step, the more suitable nodes which have less redundancy than previous nodes are more likely to be chosen.

The next node $F_{j}$ , can be chosen by ant k which may choose it from cluster i by greedy or probability algorithms. In the greedy algorithm, the next node is obtained through equation (10) [42]: $\begin{matrix} (10) & F_{j} = {arg}_{F_{u} \in U F_{i}^{k}} max {{[τ_{u}]}^{α} {[η (F_{u}, V F_{k})]}^{β}}, if q ⩽ q_{0} \end{matrix}$ where $U F_{i}^{k}$ is a set of non-chosen nodes by ant k in cluster i. $τ_{u}$ displays the corresponding pheromone of node $F_{u}$ . $V F_{k}$ is a set of chosen nodes until the time being by ant k, and $η (F_{u}, V F_{k})$ is an exploratory function which shows the suitability of node $F_{u}$ . Also as mentioned before, α and β are two parameters to control the importance of the pheromone and exploratory function. $q_{0}$ is a predefined parameter and q is a random number between 0 and 1.

In probability mode, node $F_{j}$ is chosen with the probability of $P_{k} (F_{j}, V F_{k})$ obtained via equation (11) [42]. $\begin{matrix} (11) & P_{k} (F_{j}, V F_{k}) = \{\begin{matrix} \frac{{[τ_{j}]}^{α} {[η (F_{j}, V F_{k})]}^{β}}{\sum_{u \in U F_{i}^{k}} {[τ_{u}]}^{α} {[η (F_{u}, V F_{k})]}^{β}}, & if j \in U F_{i}^{k}, if q ⩽ q_{0} \\ 0, & otherwise \end{matrix} \end{matrix}$

The state transition rule depends on two parameters of q and $q_{0}$ which creates a balance between available information and exploration of new candidate paths. If $q < q_{0}$ , then the ant will choose the next node based on the greedy algorithm and the available information otherwise, all non-chosen nodes have an equal chance to be chosen according to the probability of equation (11). It should be noted that using newly explored options prevents the algorithm from being stuck in local optimization.

4.5. Calculating the value of information diffusion in created paths

In our approach, a special exploratory function is proposed for computing the next suitable node. In this function, the node’s suitability and redundancy with previous nodes are involved. Therefore, unrelated nodes with redundancy will have less chance to be chosen. $\begin{matrix} (12) & η (F_{i}, V F_{k}) = [F S (F_{i}) + \frac{1}{∣ V F_{k} ∣} \sum_{F_{x} \in V F_{k}} sim (F_{i}, F_{x})] \end{matrix}$ where $F S (F_{i})$ is the centrality of node $F_{i}$ and $Sim (F_{i}, F_{x})$ shows the similarity between node $F_{i}$ and $F_{x}$ and $∣ V F_{k} ∣$ is the number of chosen nodes by ant k until the time being. In the first part of the exploratory function, the node suitability and its redundancy with previous nodes is considered. Thus the combination of these two parts will provide a selection of the most relevant nodes to the target node with minimum redundancy.

4.6. Updating pheromone

At the end of each iteration of the ant colony algorithm when all ants reach the end of their path, the pheromone on each node is updated. The pheromone updating rule plays an important role in the algorithm optimization. According to this rule, more pheromone is allocated to the nodes involved in a better solution. So, they will have more chances to be chosen in the next iterations. This rule is applied to update the pheromone of each node after each iteration [42]. $\begin{matrix} (13) & τ_{i} (t + 1) = (1 - ρ) τ_{i} (t) + \sum_{k = 1}^{A} Δ_{i}^{k} (t) \end{matrix}$ where $τ_{i} (t)$ and $τ_{i} (t + 1)$ are pheromone on node $F_{i}$ in the iterations of t and $t + 1$ . ρ is the pheromone evaporation parameter, A is the number of ants and $Δ_{i}^{k}$ is the pheromone which is poured on node $F_{i}$ by ant k. $Δ_{i}^{k}$ is calculated via equation (14): $\begin{matrix} (14) & Δ_{i}^{k} (t) = \{\begin{matrix} J (F S^{k} (t)), & if F_{i} \in F S^{k} (t) \\ 0, & otherwise \end{matrix} \end{matrix}$ where $F S^{k} (t)$ is a set of chosen nodes by kth ant in iteration t. Also, $J (F S^{k} (t))$ is the evaluation function for calculating the suitable subset of $F S^{k} (t)$ which is calculated based on the distance between the source and destination nodes. The shorter the distance, the more suitable the information diffusion.

5. Evaluation of the proposed method

The performance evaluation is conducted using two experiments of classification-based and information diffusion on four real datasets of Karate, Dolphin, political books and American college football networks.

5.1. Datasets

We used four datasets of Karate club, [44], dolphin [26], political books [21] and American college football [11] networks for our experiments. Table 1 shows the details of those datasets.

Table 1
Details of the used datasets

Network No. node No. edges Network description

Karate 34 78 Zachary’s karate club

Dolphin 62 159 Dolphin social network

Politic Books 105 441 Books on US politics

Football 115 6594 American college football

Network	No. node	No. edges	Network description
Karate	34	78	Zachary’s karate club
Dolphin	62	159	Dolphin social network
Politic Books	105	441	Books on US politics
Football	115	6594	American college football

5.2. Classification-based experiment

In classification based experiment, three metrics of Precision, Recall and F1 criterion [11] are used to measure the quality of the prediction results. Precision is a very important criterion in evaluating the prediction of information diffusion. Incorrect predictions can provide wrong contents for the dissemination. The Recall metric is also important in performance evaluation because it estimates the coverage extent of the nodes involved in diffusion. F1 is another suitable metric which is obtained from a harmonic average of Precision and Recall [11].

In addition to the above evaluation metrics, AUC criterion [41] is also used to evaluate the performance. AUC represents the area bellow the diagram which is ROC (Receiver Operating Characteristics). The ROC diagrams are used to study the efficiency of classifications. The bigger value of AUC means a better outcome for that classification. ROC curves are 2D curves in which True Positive Rate (TPR) is drawn on Y axis and also False Positive Rate (FPR) is drawn on X axis. Tables 2 and 3 provide the evaluation results of all approaches on the four datasets.

Table 2
Evaluation results based on F1 metric

Bayesian method [41] Community detection [42] Proposed method

Karate 68.13 69.75 70.23

Dolphin 64.78 66.43 69.86

Politic Books 59.14 62.91 67.75

Football 65.98 69.09 71.23

	Bayesian method [41]	Community detection [42]	Proposed method
Karate	68.13	69.75	70.23
Dolphin	64.78	66.43	69.86
Politic Books	59.14	62.91	67.75
Football	65.98	69.09	71.23

Table 3

Evaluation results based on AUC

	Bayesian method [41]	Community detection [42]	Proposed method
Karate	75.87	76.98	78.09
Dolphin	73.19	75.05	79.98
Politic Books	70.67	73.87	75.16
Football	74.78	76.64	79.82

The obtained results of different methods show, the community detection approach has higher precisions compared to Bayesian approach but has lower precisions compared to our proposed approach. This is because in the proposed method, the number of created clusters and their centers are controllable, so it results in more improvement in diffusion prediction.

5.3. Diffusion based experiment

Since using classification based experiments is not enough for evaluating the diffusion paths prediction, we use Cascade Independent model (ICM) [11] to observe and compare the performances of all comparison approaches. Tables 4, 5 and 6 show the experiment results on four datasets based on Precision, Recall and F1 metrics.

Table 4
The evaluation results based on precision criterion

Bayesian method [13] Community detection [42] Proposed method

Karate 88.12 90.81 91.92

Dolphin 85.67 86.78 90.81

Politic Books 82.19 83.13 88.12

Football 87.91 88.09 92.18

	Bayesian method [13]	Community detection [42]	Proposed method
Karate	88.12	90.81	91.92
Dolphin	85.67	86.78	90.81
Politic Books	82.19	83.13	88.12
Football	87.91	88.09	92.18

Table 5

The evaluation results based on recall criterion

	Bayesian method [13]	Community detection [42]	Proposed method
Karate	84.56	88.19	91.01
Dolphin	83.12	87.09	88.19
Politic Books	84.91	88.94	91.12
Football	88.71	90.19	93.82

Table 6

The evaluation results based on F1 criterion

	Bayesian method [13]	Community detection [42]	Proposed method
Karate	86.3	89.48	91.46
Dolphin	84.38	86.93	89.48
Politic Books	83.53	85.94	89.59
Football	88.31	89.13	92.99

As the results show, the proposed method has better outcomes for all metrics in comparison to the others. This is because the local and global information of nodes is used at the same time. Therefore, the main diffusion paths can be detected accurately and then the final paths are determined for all nodes.

6. Conclusion

As social networks play an important role in distributing information in big scales, many efforts have been made to predict and model the information diffusion paths. With the growth of social networks, the importance of studying and analyzing the structures and behaviors of social networks have become one of the main requirements of the market. The analysis results and investigations can be used for many different applications such as social network managements, market analysis, studying people and fans behaviors, performance improvement of recommender systems and so on.

Many approaches have been proposed to improve the accuracy of diffusion path prediction. One of the common problems with previous approaches is that there is high computation complexity for high dimensional networks. One solution to improve this is to use evolutionary algorithms, so in this work ant colony and densest subgraph algorithms were combined to improve the prediction accuracy of diffusion paths. This approach firstly creates a subset of maximum independent users and finds the centers of clusters. Then, the ant colony algorithm is used to predict the final paths of information diffusion in the network. The evaluation results of the proposed method and the representative approaches show that this approach can achieve a higher performance in comparison with other methods.

References

Arnaboldi,

Conti,

Passarella and

Dunbar, Online social networks and information diffusion: The role of ego networks, Online Social Networks and Media (2017), 44–55.

Bahmani,

Kumar and

Vassilvitskii, Densest subgraph in streaming and MapReduce, in: Proc. VLDB Endow, 2012, pp. 454–465.

Bakshy,

Rosenn,

Marlow and

Adamic, The role of social networks in information diffusion, in: Proceedings of the 21st International Conference on World Wide Web, ACM, 2012, pp. 519–528. doi:10.1145/2187836.2187907.

Bandyopadhyay, Integration of dense subgraph finding with feature clustering for unsupervised feature selection, Pattern Recognition Letters 40 (2014), 104–112.

C.M.

Bishop, Pattern Recognition and Machine Learning, Springer, 2006. ISBN 978-0-387-31073-2.

Chao,

Yuanping,

Chengyuan,

Zhihong and

Jianfeng, Stability analysis of information spreading on SNS based on refined SEIR model, China Communications 11 (2014), 24–33.

D’Angelo,

Palmieri and

Rampone, Detecting unfair recommendations in trust-based pervasive environments, Information Sciences 486 (2019), 31–51. doi:10.1016/j.ins.2019.02.015.

De Meo,

Ferrara,

Fiumara and

Provetti, Generalized Louvain method for community detection in large networks, CoRR (2011). abs/1108.1502.

Dorigo,

Maniezzo and

Colorni, Ant system: Optimization by a colony of cooperating agents. Systems, man, and cybernetics, part B: Cybernetics, IEEE Transactions on 26(1) (1996), 29–41.

10.

L.M.

Gambardella and

Dorigo, Solving symmetric and asymmetric TSPs by ant colonies. Evolutionary computation, in: Proceedings of IEEE International Conference on, 1996, pp. 622–627.

11.

Girvan and

E.J.

Newman Mark, Community structure in social and biological networks, in: Proc Natl Acad Sci, 2001, pp. 7821–7826.

12.

A.V.

Goldberg, Finding a Maximum Density Subgraph, University of California at Berkeley, 1984.

13.

Goldenberg, Talk of the network: a complex systems look at the underlying process of word-of-mouth, Marketing Letters (2001), 211–223.

14.

Granovetter, The strength of weak ties, The American J. of Sociology 78(6) (1973), 1360–1380. doi:10.1086/225469.

15.

Granovetter, Threshold models of collective behavior, American Journal of Sociology 42 (1978), 1420–1443. doi:10.1086/226707.

16.

Guang,

Assi and

Benslimane, Modeling and analysis of predictable random backoff in selfish environments, in: Proc. ACM MSWiM, 2006.

17.

Gui,

Sun,

Han and

Brova, Modeling topic diffusion in multi-relational bibliographic information networks, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management – CIKM 14, ACM Press, New York, New York, USA, 2014, pp. 649–658.

18.

Hosseini-Pozveh,

Zamanifar and

Reza Naghsh-Nilchi, Assessing information diffusion models for influence maximization in signed social networks, in: Expert Systems with Applications, 2019.

19.

Hu,

R.J.

Song and

Chen, Modeling for information diffusion in online social networks via hydrodynamics, IEEE Access 5 (2017), 128–135. doi:10.1109/ACCESS.2016.2605009.

20.

Jakomin,

Curk and

Bosnic, Generating inter-dependent data streams for recommender systems, ISimulation Modelling Practice and Theory 88 (2018), 1–16.

21.

Krebs, A dataset for books of U.S.A. politics, http://www.orgnet.com/.

22.

Y.-S.

Kwon,

S.-W.

Kim,

Park,

S.-H.

Lim and

J.B.

Lee , The information diffusion model in the blog world, in: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, 2009.

23.

Li,

Ma,

Guo and

Mei, Deepcas: An end-to-end predictor of information cascades, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 577–586. doi:10.1145/3038912.3052643.

24.

C.-T.

Li,

Y.-J.

Lin and

M.-Y.

Yeh, Forecasting participants of information diffusion on social networks with its applications, Information Sciences 11 (2018), 432–446. doi:10.1016/j.ins.2017.09.034.

25.

Li,

Wang,

Gao and

Zhang, A survey on Information Diffusion in Online Social Networks: Models and Methods, Information, Switzerland, 2017.

26.

Lusseau and

Schneider, The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations, Behavioral Ecology and Sociobiology (2003), 396–405. doi:10.1007/s00265-003-0651-y.

27.

Moradi and

Rostami, A graph theoretic approach for unsupervised feature selection, Eng. Appl. Artif. Intell. (2015).

28.

Moradi and

Rostami, Integration of graph clustering with ant colony optimization for feature selection, Knowledge-Based Systems 84 (2015), 144–161. doi:10.1016/j.knosys.2015.04.007.

29.

Moreno,

J.B.

Gómez and

A.F.

Pachec, Epidemic incidence in correlated complex networks, Marketing Letters (2003).

30.

S.J.

Park,

Soo Lim and

Woo Park, Comparing Twitter and YouTube networks in information diffusion: The case of the “occupy wall street” movement, Technological Forecasting and Social Change 95(2) (2015), 208–217. doi:10.1016/j.techfore.2015.02.003.

31.

et al., Laplacian centrality: A new centrality measure for weighted networks, Information Sciences (2012), 240–253. doi:10.1016/j.ins.2011.12.027.

32.

Rose, Social networks, information diffusion, network centrality, mixed integer programming, Economics Letters (2019), 67–70.

33.

S.J.

Russell and

Norvig, Artificial Intelligence: A Modern Approach, 3rd edn, 2010, Prentice Hall. ISBN 9780136042594.

34.

Saito and

Nakano, Prediction of Information Diffusion Probabilities for Independent Cascade Model, Springer, Berlin Heidelberg, 2008, pp. 68–75.

35.

Sudbury, The proportion of the population never hearing a rumour, Journal of Applied Probability (1985), 443–446. doi:10.2307/3213787.

36.

Tabakhi,

Moradi and

Akhlaghian, An unsupervised feature selection algorithm based on ant colony optimization, Engineering Applications of Artificial Intelligence 32 (2014), 112–123. doi:10.1016/j.engappai.2014.03.007.

37.

Wguille,

Hacid,

Favre and

Zighed, Information diffusion in online social networks: A survey, ACM SIGMOD Record 42 (2013), 17–28. doi:10.1145/2503792.2503797.

38.

Yan,

Zhaia and

Fan, C-index: A weighted network node centrality measure for collaboration competence, Journal of Informetrics 7(1) (2013), 223–239. doi:10.1016/j.joi.2012.11.004.

39.

Yang and

Leskovec, Modeling information diffusion in implicit networks, in: Proceedings of the 2010 IEEE International Conference on Data Mining, 2010, pp. 599–608. doi:10.1109/ICDM.2010.22.

40.

Yang and

Leskovec, Information is not a virus, and other consequences of human cognitive limits, Future Internet (2016).

41.

Yang and

Leskovec, Predicting information diffusion probabilities in social networks: A Bayesian networks based approach, Knowledge-Based Systems (2017), 66–76.

42.

Yazdi Kasra,

Yazdi Adel,

Khodayi,

Hou,

Zhou and

Saedy, Integrating ant colony algorithm and node centrality to improve prediction of information diffusion in social networks, in: 11th International Conference and Satellite Workshops, SpaCCS 2018, Melbourne, NSW, Australia, December 11–13, 2018, Proceedings, 2019 pp. 381–391.

43.

Yi,

Zhang and

Gan, The effect of social tie on information diffusion in complex networks, in: Physica A: Statistical Mechanics and Its Applications, 2018, pp. 783–794.

44.

Zachary, An information flow model for conflict and fission in small groups, Journal of anthropological research (1976).

Prediction optimization of diffusion paths in social networks using integration of ant colony and densest subgraph algorithms

Abstract

Keywords

1. Introduction

2. Preliminary knowledge

2.1. Machine learning

2.2. Information diffusion

2.3. Ant colony optimization

2.4. Densest subgraph

2.5. Node centrality

3. Related work

4. Proposed method

4.1. Creating primary clusters based on densest subgraph algorithm

4.4. Creating paths based on ant colony algorithm

4.5. Calculating the value of information diffusion in created paths

4.6. Updating pheromone

5. Evaluation of the proposed method

5.1. Datasets

Table 1 Details of the used datasets Network No. node No. edges Network description Karate 34 78 Zachary’s karate club Dolphin 62 159 Dolphin social network Politic Books 105 441 Books on US politics Football 115 6594 American college football

Table 2 Evaluation results based on F1 metric Bayesian method [41] Community detection [42] Proposed method Karate 68.13 69.75 70.23 Dolphin 64.78 66.43 69.86 Politic Books 59.14 62.91 67.75 Football 65.98 69.09 71.23

Table 4 The evaluation results based on precision criterion Bayesian method [13] Community detection [42] Proposed method Karate 88.12 90.81 91.92 Dolphin 85.67 86.78 90.81 Politic Books 82.19 83.13 88.12 Football 87.91 88.09 92.18

References

Table 1
Details of the used datasets

Network No. node No. edges Network description

Karate 34 78 Zachary’s karate club

Dolphin 62 159 Dolphin social network

Politic Books 105 441 Books on US politics

Football 115 6594 American college football

Table 2
Evaluation results based on F1 metric

Bayesian method [41] Community detection [42] Proposed method

Karate 68.13 69.75 70.23

Dolphin 64.78 66.43 69.86

Politic Books 59.14 62.91 67.75

Football 65.98 69.09 71.23

Table 4
The evaluation results based on precision criterion

Bayesian method [13] Community detection [42] Proposed method

Karate 88.12 90.81 91.92

Dolphin 85.67 86.78 90.81

Politic Books 82.19 83.13 88.12

Football 87.91 88.09 92.18