Abstract
One of the most important challenges of social networks is to predict information diffusion paths. Studying and modeling the propagation routes is important in optimizing social network-based platforms. In this paper, a new method is proposed to increase the prediction accuracy of diffusion paths using the integration of the ant colony and densest subgraph algorithms. The proposed method consists of 3 steps; clustering nodes, creating propagation paths based on ant colony algorithm and predicting information diffusion on the created paths. The densest subgraph algorithm creates a subset of maximum independent nodes as clusters from the input graph. It also determines the centers of clusters. When clusters are identified, the final information diffusion paths are predicted using the ant colony algorithm in the network. After the implementation of the proposed method, 4 real social network datasets were used to evaluate the performance. The evaluation results of all methods showed a better outcome for our method.
Keywords
Introduction
Social networks are new approaches of presenting connections and interactions among people by using the latest web technologies. These cross-border networks have attracted many users and their efficiency has been improving over time [27]. They are known as a suitable place for producing and sharing important materials and topics. A social network is a social structure composed of various individual or organizational groups. In other words, it is a mapping of all related edges among studying vertices and can be used to identify the social position of users. These concepts are often analyzed using the theory of graphs which plays an important role in investigating and analyzing social networks [20,30].
One of the recent research interests in the field of information diffusion is how information is distributed and what the diffusion patterns are [18,24]. The challenge is to find an efficient way to predict the paths of diffusion based on real data which has many applications in various areas such as ecommerce, virus resources detection, posting in blogs, gossip news, etc. [43]. Many approaches have been proposed for predicting information diffusion paths so far. For example, in [16,32] the popularity of news is used to find the future influenced nodes so that the nodes which influenced in the past are considered to find the diffusion pattern based on a series of parameters. Then nodes that will be influenced in the future are predicted as a function of time. The common issue with previous works is that a large number of nodes in high-dimension networks bring the necessity of dividing nodes into clusters. For instance, these researches [8,42] used the Louvain community detection algorithm to group nodes in clusters and then predict the diffusion paths. One of the issues of the Louvain algorithm is that there is no control over the centers of clusters and their numbers. The centers of clusters are important and can be used in information propagation. To improve the algorithm of predicting diffusion paths, we use the densest subgraph algorithm [2] instead of Louvain to identify the centers of clusters. The modified densest subgraph algorithm is used to find the centers of clusters. This algorithm creates a subset of nodes with maximum independency and determines the primary centers of clusters. So it is not necessary to predefine the number of input clusters. After creating the densest subgraphs, the final paths of information diffusion in the network are predicted through the ant colony algorithm on the created clusters.
In the remainder of this paper, Section 2 introduces the preliminary knowledge of the techniques and technologies used in this paper. Section 3 reviews the previous works and findings. The proposed method is described in Section 4. Section 5 discusses the performance evaluation of our approach and previous works and finally, the conclusion is given in Section 6.
Preliminary knowledge
Machine learning
Machine learning, as one of the broadly used applications of artificial intelligence, explores and regulates the ways in which computer systems can obtain the learning capability. The purpose of machine learning is to enable a computer system to learn gradually through increasing learning data and improve their performance in doing tasks. Each machine learning program needs a “training dataset” to learn what kind of information it should expect and what kind of information the planner is looking for [5]. Majority of machine learning methods use the supervised learning approach. In this approach, the system tries to learn from prior artifacts that are available to it. In other words, it tries to learn and detect patterns based on the given examples [33].
Machine learning provides a significant improvement in various fields such as text processing, risk assessment, intrusion detection, face recognition, recommender systems, image retrieval, medical diagnosis, case-based reasoning, bioinformatics, social network analysis, etc. For instance, Angelo et al. [7] used the association rule, which is a technique in machine learning, to find the unfair tuples recommender-data.
Information diffusion
Mark Granovetter first investigated the issue of information dissemination in social networks [14]. He assumed information is disseminated in social networks, without considering any specific mechanisms. Information such as news, innovations, and viruses start with a collection of seed nodes and spreads across the network [3]. The dissemination of information has been investigated in a wide range of fields including health care [19], and complex networks. One of the most important tasks on networked systems is to understand, model and predict the rapid events and developments in the network body. This is mainly due to a well-known fact that the discovery of the network structure results in predicting the patterns of social events such as their shape, size and growth known as information diffusion [23]. Various techniques and methods have been proposed for modelling the diffusion of information on homogeneous and heterogeneous networks such as those discussed in [17].
Ant colony optimization
Swarm intelligence is a systematic property in which agents interact locally and the collective behavior of all agents results in convergence at a point close to the optimal global response. In swarm intelligence, each particle has a relative autonomy in these algorithms that can move around the solution space and must work with other particles. One of the well-known algorithms of swarm intelligence is ant colony optimization which is widely used in optimizations to find the global optimum [36].
In the early 1990s, Ant System (AS) algorithm was proposed by Dorigo et al. [9], as a new heuristic to solve difficult optimization problems. Then, he proposed Ant Colony Optimization Algorithm (ACO) as a multi-agent solution for optimizing algorithms such as Travelling Salesman Problem (TPS). This algorithm is inspired by the behavior of ants that are able to find the shortest path between the nest and food source and also adapt to environmental changes [10].
A graph is used in the ant colony algorithm to represent the problem that needs to be solved or optimized. In this graph, the nodes represent the problem states, and the edges represent the transition between states. The spilled pheromones on the paths are the information collected by ants during the food search which is represented by a value on edges (for example,
Densest subgraph
The issue of identifying a maximum-weighted subgraph with some size constraint is a well-known and widespread problem in data mining and has been studied in social network analysis. In this problem, a subset of nodes that have the highest ratio of edges between pairs of nodes should be found on the given input graph [38]. In 1984, Goldberg developed a polynomial-time algorithm to find the maximum density subgraph using a max flow technique [31]. There are many variations on the densest subgraph problem. One of them is the densest k subgraph problem, where the goal is to find the maximum density subgraph on exactly k vertices.
Let
Node centrality
Identifying the nodes that are more central than others, has been a key issue which has many usages in network analysis [6,25]. Various centrality measurements have been proposed for unweighted networks such as degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and subgroup centrality. There have been several attempts to generalize the aforementioned centrality measurements for weighted networks, but they still inherit the suffering weaknesses of unweighted networks. For example, the degree centrality method is very simple but has little relevance. Global metrics such as betweenness and closeness centrality can better identify important nodes but are not ideal to be applied for large-scale networks due to the computational complexity. To deal with this issue, the Laplacian centrality measure [6] is applied to measure the node centrality in this paper. For a weighted graph G, W, and X matrices are defined as follows:
Finally, the Laplacian Centrality
The complexity of computing Laplacian centrality for network G with n vertices and maximum degree Δ would not be more than
Related work
The diffusion of information in networks is similar to the distribution of epidemics but there are differences. For example, time, relationship strength, content, social metrics, and network structure are important factors that can have impacts on the diffusion of information [4,37].
The study of information exchanging processes is one of the most important aspects of Online Social Networks (OSN) analysis. There are three main research categories about information dissemination in OSNs [12]. The first category deals with detecting popular contents which have high probability of distribution among users. In the second category, diffusion routes are modeled by detecting the paths which are more likely to be used in the network (e.g. the tree-shaped paths which are known as cascade models) and finally in the third category, the influential nodes which have high distribution potential are identified. The more influential nodes can generate big cascades because of their properties. In this paper, we focus on the second and third categories, as they are directly influenced by the human cognitive limitation. The first category is more about the type of contents circulating in the network, so it should not be influenced by that limitation (at least not directly) [1].
Several approaches have been proposed for modeling the information diffusion through explicit relationships in social networks. Those approaches are mostly based on linear threshold model [15], independent cascade model [13], and general cascade model [29]. The idea of these models is that nodes may influence each other, so a node may have been influenced by other nodes and represents similar characteristics. It means nodes with high influential impacts (high centrality) can influence on other nodes (with less centrality) on the diffusion paths.
In the linear threshold model, each node is infected at a given time period, if the sum of its incident edges’ weights (which are the neighbors of the last node in the diffusion path) is above a given threshold. The diffusion process is stopped in this model when no new nodes are infected during a certain time period. In spite of simplicity of this mode, it has been largely used to model information diffusion in OSNs [35]. In this approach, either the weights of edges which have diffusion possibilities are the same, or the weights are set based on the maximum estimation likelihood of real diffusion cascades observations.
Granovetter et al. [15] proposed the linear threshold model which determines a threshold value for each node and a weight for each edge. When the accumulated value of all neighbors’ weights of a specific node is greater than the threshold value, it is linked to the diffusion path. The assumption in cascade based models is that the distribution of contents between users in the social network is based on explicit links [34,39].
The independent cascade model, assuming the selection of a node in the diffusion path, is not influenced by neighbors but by a single neighbor [13,22]. It means, the next node is chosen using the independent relationship between two nodes. This model determines a probability value for edges, so the selection takes place based on that probability [29]. Furthermore, the non-explicit relationships are not used in this model.
Lerman et al. [40] recently discussed that the diffusion paths generated by general cascade and linear threshold models are far from being realistic, as they are often largely different from real diffusion paths at OSNs. In general cascade model, the condition of independent cascade mode is eliminated, so all neighbors have influence on the selection of the next path independently, therefore the characteristics of the linear threshold model and the independent cascade model are generalized [1].
Proposed method
In this section, we describe our proposed method for enhancing the prediction accuracy of diffusion paths in social networks using the integration of ant colony and densest subgraph algorithms.
In some previous works, community detection algorithms like Louvian were used to improve information diffusion in the network [42]. One of the issues of this method is that it tries continuously to optimize and merge the clusters. In other words, the number of clusters is reduced as the optimization is continued. Moreover, the head clusters are not predefined and also there is no control over them. The benefit of detecting head clusters is that they can be used for optimizing information diffusion. Thus, we used the densest subgraph algorithm to detect the initial head clusters and then clustering the nodes based on them.
Creating primary clusters based on densest subgraph algorithm
In this stage, the graph nodes are divided into several clusters by using the densest subgraph algorithm. Let’s assume
The complexity of finding the densest subgraph of a given graph is NP-hard. Several approximate algorithms have been proposed for directed and undirected graphs [41]. We use one of these algorithms which can find the densest subgraph by adding a limitation of minimum size of the subgraph

Pseudo code of the modified densest subgraph algorithm
The obtained subgraph of the modified algorithm is used as a set of primary centers. According to densest subgraph algorithms, the centers of clusters should be away from each other as much as possible in terms of similarity. Therefore, the main goal of this stage is to find a subset of
The primary centers obtained from the previous stage might not be the best results. Therefore, in the second phase of the clustering algorithm, a new technique is used to find more accurate centers. To this end, in the first step, each node of the graph is assigned to its nearest center to form the primary clusters. Then, to find the new possible center, the sum of similarity values between a candidate center in the cluster,
After determining new centers of all clusters, other vertices will join the nearest clusters. This process is continued repeatedly until centers of the clusters remain unchanged. Also, vertices that have not been chosen as centers will join the cluster of the nearest center.

Different stages of densest subgraph algorithm; (a) the input graph, (b) the output of the modified densest subgraph algorithm; (c) assigning vertices to clusters (d) updating centers of the clusters and (e) merging clusters.
At the end of the second stage, some clusters may not have enough members to be used in the diffusion paths prediction stage. The small number of members for a cluster, not only causes less accurate results but also reduces the coverage rate of the prediction diffusion paths, so to overcome this issue, clusters with smaller number of members than the threshold are merged to the nearest cluster and the previous cluster is removed in the third stage of the algorithm.
Figure 1 displays an example of the densest subgraph algorithm. In part (a), nodes of a system are modeled as a graph. The edges’ weights in the graph show the similarity between vertices. Part (b) shows the output of the modified algorithm in finding four vertices as the primary centers. Part (c) depicts the primary clusters of the graph which are formed based on the primary centers. In the next stage, the primary centers are replaced with the new centers using equation (5). The new clusters and their members are displayed in part (d). As it can be seen,
The information gathered by ants is put on edges along the search process as pheromone, so
Creating paths based on ant colony algorithm
In this step, the optimal dissemination paths are determined according to the ant colony algorithm. To achieve this goal, each node of the social network is considered as a node in the ant colony algorithm. In general, the process of ant colony optimization algorithm is as follows: first, some ants are placed on some of the nodes of the graph randomly. Then, ants create their own solutions to the problems using the state transition rule consecutively. Ants prefer to go through paths that have high pheromone (shorter edges). When the transition of all ants is finished and their solutions were obtained, a general pheromone update rule is used. In this rule, each ant puts pheromone on edges and also pheromone on some edges evaporates. In other words, the edges of better solutions receive more pheromone. This process continues until the predefined stop condition is met. The transition rule is determined by ant colony optimization algorithm which is a combination of the exploratory information and the pheromone. When an ant is in node r, it chooses node s for the transition by using the rule in equation (7) [42].
In the probability mode of
Equation (9) represent an example of pheromone update rule in Equation (8) [42]
The next node
In probability mode, node
The state transition rule depends on two parameters of q and
Calculating the value of information diffusion in created paths
In our approach, a special exploratory function is proposed for computing the next suitable node. In this function, the node’s suitability and redundancy with previous nodes are involved. Therefore, unrelated nodes with redundancy will have less chance to be chosen.
Updating pheromone
At the end of each iteration of the ant colony algorithm when all ants reach the end of their path, the pheromone on each node is updated. The pheromone updating rule plays an important role in the algorithm optimization. According to this rule, more pheromone is allocated to the nodes involved in a better solution. So, they will have more chances to be chosen in the next iterations. This rule is applied to update the pheromone of each node after each iteration [42].
Evaluation of the proposed method
The performance evaluation is conducted using two experiments of classification-based and information diffusion on four real datasets of Karate, Dolphin, political books and American college football networks.
Datasets
We used four datasets of Karate club, [44], dolphin [26], political books [21] and American college football [11] networks for our experiments. Table 1 shows the details of those datasets.
Details of the used datasets
Details of the used datasets
In classification based experiment, three metrics of Precision, Recall and F1 criterion [11] are used to measure the quality of the prediction results. Precision is a very important criterion in evaluating the prediction of information diffusion. Incorrect predictions can provide wrong contents for the dissemination. The Recall metric is also important in performance evaluation because it estimates the coverage extent of the nodes involved in diffusion. F1 is another suitable metric which is obtained from a harmonic average of Precision and Recall [11].
In addition to the above evaluation metrics, AUC criterion [41] is also used to evaluate the performance. AUC represents the area bellow the diagram which is ROC (Receiver Operating Characteristics). The ROC diagrams are used to study the efficiency of classifications. The bigger value of AUC means a better outcome for that classification. ROC curves are 2D curves in which True Positive Rate (TPR) is drawn on Y axis and also False Positive Rate (FPR) is drawn on X axis. Tables 2 and 3 provide the evaluation results of all approaches on the four datasets.
Evaluation results based on F1 metric
Evaluation results based on F1 metric
Evaluation results based on AUC
The obtained results of different methods show, the community detection approach has higher precisions compared to Bayesian approach but has lower precisions compared to our proposed approach. This is because in the proposed method, the number of created clusters and their centers are controllable, so it results in more improvement in diffusion prediction.
Since using classification based experiments is not enough for evaluating the diffusion paths prediction, we use Cascade Independent model (ICM) [11] to observe and compare the performances of all comparison approaches. Tables 4, 5 and 6 show the experiment results on four datasets based on Precision, Recall and F1 metrics.
The evaluation results based on precision criterion
The evaluation results based on precision criterion
The evaluation results based on recall criterion
The evaluation results based on F1 criterion
As the results show, the proposed method has better outcomes for all metrics in comparison to the others. This is because the local and global information of nodes is used at the same time. Therefore, the main diffusion paths can be detected accurately and then the final paths are determined for all nodes.
As social networks play an important role in distributing information in big scales, many efforts have been made to predict and model the information diffusion paths. With the growth of social networks, the importance of studying and analyzing the structures and behaviors of social networks have become one of the main requirements of the market. The analysis results and investigations can be used for many different applications such as social network managements, market analysis, studying people and fans behaviors, performance improvement of recommender systems and so on.
Many approaches have been proposed to improve the accuracy of diffusion path prediction. One of the common problems with previous approaches is that there is high computation complexity for high dimensional networks. One solution to improve this is to use evolutionary algorithms, so in this work ant colony and densest subgraph algorithms were combined to improve the prediction accuracy of diffusion paths. This approach firstly creates a subset of maximum independent users and finds the centers of clusters. Then, the ant colony algorithm is used to predict the final paths of information diffusion in the network. The evaluation results of the proposed method and the representative approaches show that this approach can achieve a higher performance in comparison with other methods.
