Community Detection in Social Networks Using Affinity Propagation with Adaptive Similarity Matrix

Abstract

Community detection problem is a projection of data clustering where the network's topological properties are only considered for measuring similarities among nodes. Also, finding communities' kernel nodes and expanding a community from kernel will certainly help us to find optimal communities. Among the existing community detection approaches, the affinity propagation (AP)-based method has been showing promising results and does not require any predefined information such as the number of clusters (communities). AP is an exemplar-based clustering method that defines the negative real-valued similarity measure sim(i, k) between data point i and exemplar k as the probability of k being the exemplar of data point i. According to our intuition, the value of sim(i, k) should not be identical to sim(k, i). In this study, a new version of AP using an adaptive similarity matrix, namely affinity propagation with adaptive similarity (APAS) matrix, is proposed, which could efficiently show the leadership probabilities between data points. APAS can adaptively transform the symmetric similarity matrix into an asymmetric one. It outperforms AP method in terms of accuracy. Extensive experiments conducted on artificial and real-world networks demonstrate the effectiveness of our approach.

Introduction

Analyzing various kinds of networks, especially social networks, has attracted a lot of attention in recent years due to the structural properties distinguishing these networks from the random ones. One of these distinguishing properties is the community structure.¹ As a primary definition, a community in a network is a set of nodes that the edges (or links) inside this set are much more than the edges between the set and outside of it. Generally, the nodes belonging to the same community have common features and play similar roles in the network. Detecting communities of a network reveals beneficial information about the structure and nature of the network.^2,3 Social networks as a major part of the complex networks present community structure, as well. A node in a social network can be an individual, a group or an organization. These nodes are connected by edges indicating social relationships. There are lots of methods dealing with the problem of community detection in social networks. Among the clustering-based community detection methods, the affinity propagation (AP) one has been used in various approaches.^4,5 Despite other clustering-based methods, AP does not require any predefined information about the under process network. It just takes the similarity matrix between data points as input and efficiently clusters the network in a relatively short time. AP is an exemplar-based clustering method. It defines one representative for each cluster, so that the clusters are recognized by their representatives. When the representative of a cluster is chosen among the real data belonging to that cluster, it is called an “Exemplar.”

Algorithm 1 highlights the major steps of AP. There are two kinds of messages exchanged in each iteration of the AP algorithm: Responsibility and Availability. The responsibility message $r (i, k)$ , sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i. The availability message $a (i, k)$ , sent from candidate exemplar point k to point i, reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar.⁴ In each iteration of the AP algorithm, the values assigned to the responsibility and availability matrices are updated using the following rules: $r (i, k) \leftarrow s i m (i, k) - {max}_{k' s . t . k' \neq k} \{a (i, k') + s i m (i, k')\}$ (1) $a (i, k) \leftarrow m i n \{0, r (k, k) + \sum_{i' s . t . i' \notin \{i, k\}} m a x \{0, r (i', k)\}\}$ (2)

a (k, k) \leftarrow \sum_{i' s . t . i' \neq k} m a x \{0, r (i', k)\}

(3)

The similarity $s i m (i, k)$ indicates how well the data point with index k is suited to be the exemplar for data point i.⁶ Due to a definition of AP being the solution for minimizing the energy function $E (c) = - \sum_{i = 1}^{N} s i m (i, c_{i})$ , the similarity matrix used in AP must have negative real-valued elements.⁷ The values assigned to $s i m (k, k)$ , also known as the preferences, effect the final number of clusters. Data points with high values of preference are more likely to be chosen as the exemplars. In this study, the preferences are set to a common value of median of the whole similarity matrix. AP algorithm works as follows:

The responsibility and availability of all the pairs are initially set to zero. In each iteration, the responsibility matrix is primarily updated using the availability matrix of the previous iteration. Then the availability matrix is updated. Finally, the exemplar index for each node is determined by the following rule: $i d x (i) = {argmax}_{k} \{r (i, k) + a (i, k)\}$ (4)

Termination condition of the algorithm could be set to the determined maximum iteration number, or no changing on exemplar indices for a defined number of iterations. To overcome the numerical oscillations during matrix calculations, update rules can be modified using a damping factor as follows: $\begin{matrix} r^{(t + 1)} (i, k) \leftarrow (1 - λ) \\ (s i m (i, k) - {max}_{k' s . t . k' \neq k} \{a (i, k') + s i m (i, k')\}) + λ r^{(t)} (i, k) \end{matrix}$ (5) $\begin{matrix} a^{(t + 1)} (i, k) \leftarrow (1 - λ) \\ (m i n \{0, r (k, k) + \sum_{i' s . t . i' \notin \{i, k\}} m a x \{0, r (i', k)\}\}) \\ + λ a^{(t)} (i, k) \end{matrix}$ (6)

\begin{matrix} a^{(t + 1)} (k, k) \leftarrow (1 - λ) (\sum_{i' s . t . i' \notin \{i, k\}} m a x \{0, r (i', k)\}) \\ + λ a^{(t)} (k, k) \end{matrix}

(7)

Algorithm 1: Affinity Propagation
Input: A similarity matrix sim, maxits, convits, and a damping factor lam
Output: idx
sim(k,k) ← median(s)
responsibility, availability ←0
Repeat
Compute responsibility matrix using Eq. (5)
Compute availability matrix using Eqs. (6) and (7)
Determine idx vector using Eq. (4)
Until the number of iterations reaches maxits, or idx does not change for convits iterations

As a clustering algorithm, AP needs the pairwise similarities between all nodes to put the most similar nodes in the same clusters. Each node is assigned to the most similar representative. However, AP gives the similarity matrix a more comprehensive functionality. It uses the pairwise similarities in each iteration to compute availability and responsibility matrices, and finally, to determine cluster exemplars. Therefore, the similarity matrix has a direct influence on choosing the cluster exemplars. AP defines the negative real-valued similarity $s i m (i, k)$ between the data points i and k as the probability of k being the exemplar of i. Most of the existing AP-based community detection methods utilize a similarity measure to gain the pairwise similarities.^8–14 These measures assume that the value of $s i m (i, k)$ is identical to $s i m (k, i)$ and achieve a symmetric similarity matrix. However, according to our intuition, using a symmetric similarity matrix as the input of AP algorithm does not make sense. As in AP, each element $s i m (i, k)$ on the similarity matrix demonstrates the leadership probability of the node k over the node i. Hence, in this study, an improved version of AP is proposed by applying an asymmetric similarity matrix that could efficiently show the leadership probabilities between data points.

The rest of the article is organized as follows. The Related Works section reviews current studies on AP-based community detection methods. In the Affinity Propagation with Adaptive Similarity Approach section, we describe our approach in detail. The Experiments section consists of the experimental results conducted on real-world and artificial networks. Finally, conclusions are discussed in the Conclusions section.

Related Works

There have been various approaches applying AP for community detection task since it was proposed by Frey and Dueck.⁴ Most of them try to reach more accurate clustering results using different similarity metrics. In Jia et al.⁸ and Lai and Lu,¹¹ authors claim that the measures based on the number of common neighbors do not consider the topological structure of the whole network. Accordingly, they exploit the similarity of regular equivalence (RE) proposed by Leicht et al.¹⁵ RE uses the global information, which considers different roles of short and long paths. Wang et al. proposed an improved AP algorithm named semi-supervised hierarchical optimization-based AP.¹⁶ This algorithm introduces the idea of semisupervision by setting a certain proportion of label data and using the AP for clustering purpose, then establishes the supervision and nonsupervision information matrix to optimize, and combines the result of AP algorithm. Liu introduced three useful similarity measures such as diffusion kernel (DK) feature matrix, shortest path-based similarity, and adjacency matrix-based similarity.¹⁰ These similarity measures are then applied to the AP algorithm and the clustering results are compared with each other. In Lai et al.,¹³ authors exploited random walk to accomplish a community membership-based node to node similarity. In Liu et al.,⁹ authors applied a number of traditional similarity measures that are based on the number of common neighbors. Ding et al. used a similarity based on commute-time kernel.¹² They first mapped the nodes of graph into another feature space. Then, the similarity is computed easily. Wang et al. proposed a new approach for clustering the social network's graph by considering the content and link information of the social network.¹⁷ Furthermore, they analyzed the centrality of the detected communities in a real social network class as a case study and they found that the detected communities could significantly help to improve the quality of the teaching and learning based on each community's properties. In Guo and Zhang,¹⁸ authors only calculated the responsibilities and availabilities between each node and its neighbors. Here, a recursive method is proposed: the clusters found in the last iteration of AP are treated as the new nodes and partitioned recursively until the clustering results no longer change. Zhu et al. proposed a new method to measure the similarities between data points and transfer the messages only among pairs with similarity values >0.¹⁴ Jia et al.¹⁹ proposed a fast version of AP in two stages. In the first stage, the message passing occurs only between each node and its k most similar nodes. Then, the exemplars from the first stage are considered the candidates for final exemplars and a new graph structure is constructed.

There are some other researches improving AP to exploit it in different fields. For example, Wang et al. developed a method based on the propagation of affinity and domain architecture to predict protein functions in the format of Gene Ontology (GO) terms.²⁰ In their method, the AP was used to cluster GO-directed acyclic graph after a preprocessing step. The K-AP method was proposed to generate specified K clusters using AP.²¹ Luo et al. proposed a local community detection method using a dynamic membership function.²² Guo and Zhang proposed a new local AP algorithm to discover communities in social networks by propagating affinity messages among members in a localized manner.²³ The affinity between any two members was computed by the gray relational analysis method. The authors in Zhu et al.¹⁴ proposed a semisupervised clustering scheme based on AP. It defines a new measure of similarity for the case there is prior knowledge about the membership of some of the nodes in the network. This similarity is then applied to the AP algorithm. In Lai et al.^13, and Wang et al.,²⁴ a new schema of preference setting for AP method was applied. In Guo and Zhang,²⁵ authors combined AP with modularity maximization to obtain an efficient clustering of network. It constructs a dissimilarity distance matrix for the network in hand. This matrix is then applied to the AP algorithm to obtain a candidate set of exemplars. This incremental manner continues until no more exemplar remains in the candidate set. The final community of the network is the partitioning with maximum value of modularity. Shang et al. proposed a multiobjective evolutionary algorithm using the AP algorithm to find the communities with higher efficiency than other compared community detection algorithms.²⁶ In their method, AP is initially used to partition the network, and then, to reach a fast convergence, the multiobjective evolutionary algorithm selects the nondominated solutions from the preliminary partitioning results of AP as its initial population.

All the mentioned AP-based methods apply a symmetric similarity matrix into AP, where the value of sim(i,k) is identical to the value of sim(k,i). AP defines the negative real-valued similarity sim(i,k) between data point i and exemplar k as the probability, indicating how well the data point with index k is suited to be the exemplar for data point i. According to our intuition, this value is not necessarily identical to the value of sim(k,i). In the next section, a new approach of AP, applying an asymmetric similarity matrix, is proposed.

There are also many popular community detection algorithms such as label propagation algorithm (LPA),²⁷ Louvain,²⁸ and fast Newman.²⁹ In the LPA algorithm every node is initialized with a unique label and at each step of the procedure iteration, each node adopts the label that most of its neighbors currently have. In this iterative process, densely connected groups of nodes form a consensus on a unique label to form the communities. LPA requires neither optimization of a predefined objective function nor prior information about the communities. Louvain²⁸ is another heuristic method based on modularity maximization. The algorithm consists of two phases. In the first phase, each node is considered a different community. For each node i, the change of modularity by removing i from its community and moving it into a neighboring community is computed. The node i is then moved to the community with the maximum modularity increment. This process is repeated until no more increment in modularity is possible. In the second phase, all the communities obtained in the first phase are treated as new nodes of the network and a merging process is applied on the new nodes. Louvain is fast when applying to the large-scale real-world networks. It is easy to implement. In Newman,²⁹ Newman proposed a greedy optimization method for the modularity maximization problem. In this method, each node is supposed to be a different community with one member. Afterward, the pair of communities with the largest increase in modularity merges repeatedly until all the nodes are merged together in a sole community. This Newman method proved to be a very fast algorithm for the modularity maximization problem.

Stepping community detection algorithm based on label propagation and similarity (stepping LPA-S) is a new label propagation-based algorithm, in which labels are propagated using the resource allocation similarity index. It updates the label of a node according to the most similar label of its neighbors and then initial communities with high similarity are merged. In Berahmand and Bouyer,¹ a new extended LPA-based algorithm was proposed to improve the stability and accuracy of the traditional LPA by defining two concepts—nodes and link strength based on semilocal similarity. Zhang et al. proposed a new label propagation algorithm based on node ability (NALPA) to discover communities in complex networks.³¹ NALPA used four factor as node abilities, including propagation ability, attraction ability, launch ability, and acceptance ability. Wang et al. proposed a new algorithm based on the location of community structural centers detection and expanding (LCCD) algorithm for detecting communities.³² This algorithm initially finds central nodes of communities based on the structural centrality that uses local density and distances between nodes to calculate it. After identifying central nodes, it expands the local community from inner central nodes to outer nodes. Node's label influence policy for label propagation algorithm was another LPA-based algorithm for detecting community structures using link strength value for edges and nodes' label influence value for nodes.³³ It used this measure to select core nodes and label updating. It mostly removed tiebreak state of traditional LPA. A robust two-stage local community detection algorithm (RTLCD) is a local community detection based on core detection and community expanding from these cores. It has near-linear time complexity but low accuracy.³⁴ Community detection based on expanding core nodes using extended similarity (ECES) algorithm is a new local method based on detecting and expanding the core nodes.³⁵ A centrality criterion is used that gives high scores to central nodes with a high degree of embeddedness. First, the weights between edges are calculated using a similarity index. Next, the core dominance of the kernel node is calculated to select central nodes. The next step is to expand the central nodes using a membership function. At that time, the merging phase is used to form final communities. Tasgin and Bingol proposed the community detection algorithm based on boundary nodes at group level common neighbors (G-CN) algorithm as a local community detection method that considers a label propagation process such as LPA.³⁶ However, G-CN only updates the labels for boundary nodes at each iteration to reduce the execution time. It has two weaknesses: resolution limit and low accuracy problem.³⁷ To overcome these problems, a hybrid algorithm based on node scoring and synchronous label updating of boundary nodes along with disabling random label updating at group level common neighbors was proposed that uses a hybrid node scoring and synchronous label updating of boundary nodes, along with disabling random label updating in initial updates. For this purpose, it updates the label of boundary nodes in a synchronous manner using a new node scoring measure.³⁷

Affinity Propagation with Adaptive Similarity Approach

AP takes a network as an input and clusters it based on the pairwise similarities between its nodes. According to our intuition, these similarities should not be symmetric as they determine the leadership probability of a node over another one. Furthermore, AP is an iterative method and the exemplar indices. However, AP only utilizes the information of the last iteration, and the information gained in previous iterations is wasted. Besides these facts, we claim that the exemplar of a cluster is a node with the highest centrality in that cluster. Considering these intuitions, a new approach of AP is proposed, in which the similarity matrix used in every iteration of AP is not a constant one, but it is adapted according to the information obtained in each iteration. The adapted similarity matrix is used as an input for the next iteration. Therefore, in the second iteration of AP, the similarity matrix returns the symmetric mode and gets more accurate values as the algorithm continues. To this end, we need to utilize the information obtained in each iteration of AP. At the end of each iteration, after the exemplar indices for all the nodes are computed, the centrality values of the exemplars in their clusters are calculated. An exemplar should have the highest centrality value in its cluster. If exemplar k in cluster C, idx(k) = C, has the highest centrality value, then k is an appropriate exemplar for the nodes in C, and the similarities between the nodes in C and exemplar k, sim(i,k)_i∈C, should increase. This increment is done using the following formula: $s i m {(i, k)}_{i \in C}^{(t + 1)} \leftarrow s i m {(i, k)}_{i \in C}^{(t)} * I n c e n t r a l i t y {(k)}^{(t)},$ (8)

where Incentrality(k) is the centrality value of exemplar k in cluster C. There are a number of centrality metrics that can be used here. As the utilized metric does not have much impact on the result our experiments, in this study, “degree centrality” is used to measure centrality values. This centrality metric is very simple and has a low time complexity. The degree centrality value of exemplar k in cluster C is the number of direct neighbors of k in cluster C divided by the cardinality of C. As the similarity sim(i,k) is negative and Incentrality(k) takes value in the interval [0,1], the Eq. (8) changes as follows: $s i m {(i, k)}_{i \in C}^{(t + 1)} \leftarrow s i m {(i, k)}_{i \in C}^{(t)} * (1 - I n c e n t r a l i t y {(k)}^{(t)})$ (9)

Similarly, if exemplar k in cluster C, idx(k) = C, does not have the highest centrality value, then k is not an appropriate exemplar for the nodes in C, and the similarities between the nodes in C and exemplar k, sim(i,k)_i∈C, should be decreased. This decrement is done using the following formula: $s i m {(i, k)}_{i \in C}^{(t + 1)} \leftarrow s i m {(i, k)}_{i \in C}^{(t)} ∕ I n c e n t r a l i t y {(k)}^{(t)}$ (10)

Finally, the preference values need to be updated using the new similarity matrix: $s i m {(k, k)}^{(t + 1)} = m e d i a n (s i m {(:, k)}^{(t + 1)}),$ (11)

where sim(:,k) is a vector demonstrating the similarity values between all nodes of the network and k. Algorithm 2 represents the similarity updating process of the affinity propagation with adaptive similarity (APAS) approach.

Algorithm 2: Similarity Update of the Affinity Propagation with Adaptive Similarity Approach
Input: Similarity matrix sim used in iteration (t), idx vector obtained in iteration (t)
Output: Similarity matrix of iteration (t + 1)
exemplars ← unique(idx)
for all exemplars do
compute degree incentrality for exemplar k in cluster C
if incentralty(k) = argmax _i∈C incentrality(i)
increase the similarities between the nodes in C and exemplar k using Eq. (9)
else
decrease the similarities between the nodes in C and exemplar k using Eq. (10)
end if
end for
for all the nodes do
update preference values using Eq. (11)
end for

Refinement of the idx vector

At the end of each iteration of AP, idx vector is obtained. This vector does not require any refinement process, since it is not used in the next iteration of AP. On the contrary, APAS refines idx vector before updating similarity matrix due to the fundamental role of the idx vector in the updating process. Refinement of the idx vector is needed when there is a conflict in the exemplar index of an exemplar. AP requires that the exemplar of a cluster is chosen among the data belonging to that cluster. When an exemplar does not choose itself as exemplar, idx(k) ≠ k, the idx vector needs to be refined before being transferred to the similarity update function. The refinement is done in two conditional steps as follows:

1. If there exist some nodes choosing k as their exemplars in this iteration, but k has not chosen itself as exemplar, then all the nodes choosing k will be joint to the k's selected exemplar:

if ∃ i, idx(i) = k & idx(k) ≠ k

idx(i) ← idx(k)

end if

2. If there exist some nodes choosing k as their exemplars in this iteration, but k has chosen another node named j as its exemplar, and j has chosen k as its exemplar, from k and j, the exemplar with fewer number of followers (along with its followers) will join to another one:

if ∃ i, idx(i) = k & idx(k) = j & idx(j) = k

if length(find(idx = j)) > length(find(idx = k))

idx(i) ← j

else idx(k) ← k end if

end if

Algorithm 3 presents the procedure of proposed APAS approach.

Algorithm 3: Affinity Propagation with Adaptive Similarity Approach
Input: A similarity matrix sim, maxits, convits, and a damping factor lam
Output: idx
sim(k,k) ← median(s)
responsibility, availability ←0
Repeat
Compute responsibility matrix using Eq. (5)
Compute availability matrix using Eqs. (6) and (7)
Determine idx vector using Eq. (4)
Refine idx vector using two conditional steps
Update similarity matrix using Algorithm 2
Until the number of iterations reaches maxits, or idx does not change for convits iterations

Experiments

APAS takes a similarity matrix between data points as an input. To measure the similarity in this study, we utilize metrics used in Liu,³⁸ and Jia and Yu³⁹ since they have proven to be efficient when applying to the AP algorithm,³⁸ apply three different similarity measures to AP. One of these measures, called “Diffusion Kernel,”⁴⁰ is a feature matrix that is commonly used in different fields. Given the graph of a social network, the (opposite) Laplacian of the network is defined as follows: $L_{i j} = \{\begin{matrix} 1, i \sim j \\ - d_{i}, i = j \\ 0, o t h e r w i s e \end{matrix},$ (12)

where i–j means that nodes i and j are connected by an edge, and d_i is the degree of node i. The exponential of the matrix L is defined as follows: $K = exp (β L) = {lim}_{n \to \infty} {(1 + \frac{β L}{n})}^{n},$ (13)

where β is a positive constant to control the degree of diffusion, and exp(L) represents the matrix exponential of the Laplacian matrix L. We set β to be 0.4 in our experiments. Finally, the non-negative similarity matrix DK is defined by normalizing the kernel matrix K: $D K (i, j) = \frac{K_{i j}}{\sqrt{K_{i i} K_{j j}}}$ (14)

The method in Jia and Yu³⁹ exploits the similarity of RE proposed by Leicht and Newman.⁴¹ RE uses the global information, which considers different roles of short and long paths. The idea is that two nodes are similar if they are connected to other similar nodes. The similarity matrix RE is obtained in an iterative method using the following equation: $T^{(i + 1)} = \frac{α}{λ_{1}} A * T^{(i)} + I,$ (15)

where T⁽ⁱ⁾ is a temporal matrix variable used for calculating similarity at ith iteration. Parameter α is a convergence factor controlling the weights of short and long paths. In real-world social networks, the parameter α usually takes values in the range of 0.90–0.99 (in this study, we set α to be 0.9). λ₁ is the largest eigenvalue of the adjacent matrix A of the network. Initializing T to 0, that is, T(0) = 0, the elements of the similarity matrix RE will converge to non-negative values in a very short time.

As stated before, the similarity matrix used in AP must have negative real-valued elements. Therefore, we turn the non-negative values of DK and RE matrices into negative ones using the following assignment: $s i m = s i m - M,$

where M is the maximum value of the similarity matrix sim. In the following, APAS is compared with two abovementioned qualified approaches applying the original AP method. For simplicity, we name these two approaches as AP-DK and AP-RE. In all the experiments, we have set the damping factor λ to 0.9, the maximum number of iteration maxits to 3000, and the termination parameter convits to 100.

Evaluation metrics

There are several metrics to evaluate a clustering of a network. Here, we apply the two most common metrics in community detection: “NMI” and “Modularity.” Normalized mutual information (NMI)⁴² compares two partitions of a network and is beneficial when the natural clustering of the network is available. Given X = (X₁,X₂,…,X_nx) with nx number of clusters and Y = (Y₁,Y₂,…,Y_ny) with ny number of clusters as two different partitions of a network, NMI is defined as follows: $\begin{matrix} I_{n o r m} (X, Y) & = \frac{2 I (X, Y)}{H (X) + H (Y)} \\ = \frac{- 2 \sum_{x \in X} \sum_{y \in Y} | x \cap y | log (\frac{| x \cap y | n}{| x | | y |})}{\sum_{x \in X} | x | log (\frac{| X |}{n}) + \sum_{y \in Y} | y | log (\frac{| y |}{n})} \end{matrix}$ (16)

NMI equals 1 when two partitions are identical.

Modularity⁴³ is another evaluation metric currently very often used in community detection. Modularity of the partition P is defined as follows: $Q (P) = \sum_{C \in P} e (c) - a {(c)}^{2},$ (17)

where e(c) is the fraction of internal edges of cluster c, and a(c) is the fraction of edges with at least an endpoint in cluster c.

Experiments for NMI measure on real-world networks

In this section, four popular real-world networks with ground-truth, including Zachary karate club, bottlenose dolphins, Football and Political books are used to examine the NMI measure of our proposed APAS approach. The results of the AP and APAS methods using DK and RE similarities on real-world networks are separately shown in Tables 1 and 2.

Table 1.

Normalized mutual information values of affinity propagation diffusion kernel and affinity propagation with adaptive similarity on four real-world networks using negative diffusion kernel similarity

	Karate	Dolphins	Football	Polbooks
AP-DK	0.831	0.35	0.911	0.393
APAS (DK)	1.0	0.811	0.924	0.579

Bold value indicates the best result in the table.

AP, affinity propagation; APAS, affinity propagation with adaptive similarity; DK, diffusion kernel.

Table 2.

Normalized mutual information values of affinity propagation diffusion kernel and affinity propagation with adaptive similarity on four real-world networks using negative regular equivalence similarity

	Karate	Dolphins	Football	Polbooks
AP-RE	0.926	0.454	0.889	0.535
APAS (RE)	1.0	1.0	0.921	0.617

Bold values indicate the best results in the table.

RE, regular equivalence.

As we can see, APAS outperforms AP when applying to the real-world networks whether using DK or RE similarity matrices. APAS achieves more accurate results due to utilizing the information obtained in all the iterations. Therefore, the elements of similarity matrix used in APAS get more accurate values compared with the original AP algorithm. In both methods, the final number of communities is simultaneously determined by user-defined preferences and message passing procedure. Data points with high values of preference are more likely to be chosen as exemplar. In most cases, there is no information about the leadership probability of data points. Hence, usually the preference values are set to a common value. Selecting this value is of great importance. It is hard to know which value of preference would give the most optimal communities. A large value of preference will result in a large number of small communities and vice versa. There are many methods presented to select an optimum value for the preference. Most of these methods search the optimal number of clusters to set the best preference value. Anyway, it is not an important issue here as we utilize the same method for setting the preference value in both AP and APAS methods. The preferences are set to a common value of median of the whole similarity matrix both in AP and APAS methods. Therefore, the way that is used to set the preferences does not influence the final result of comparison between these two methods.

On the contrary, we also compared the APAS algorithm with other well-known, newly proposed, and none AP-based methods such as the method of Clauset, Newman, and Moore (CNM), Infomap, LCCD, ECES, NALPA, Stepping LPA-S, RTLCD, and Louvain. Table 3 shows the experimental results on four ground-truth networks based on NMI measure. It is obvious that APAS (RE) has the best performance in all data sets, except the Football data set. In the Karate data set, both APAS (RE) and APAS (DK) along with NALPA and RTLCD have the best NMI equal to 1. In the Dolphins data set, only APAS (RE) has NMI = 1. Stepping LPA-S has the second-best NMI. In the Polbooks data set, APAS (RE) and APAS (DK) are the first- and second-best algorithms, respectively. NALPA has the third rank among the compared algorithms. However, in Football data sets, APAS (DK) has the best and APAS (RE) has the second-best performance. Besides performance, another advantage of APAS is its reliability in finding communities in comparison with other methods. In several runnings, it has reliable performance and NMI measure.

Table 3.

Normalized mutual information results obtained from experiments on small real-world networks with ground truth

Data set	CNM	Infomap	LCCD	ECES	NALPA	Stepping LPA-S	RTLCD	Louvain	APAS (RE)	APAS (DK)
Karate	0.69	0.7	0.71	0.63	1	0.92	1	0.59	1	1
Dolphins	0.55	0.54	0.53	0.49	0.81	0.88	0.45	0.52	1	0.811
Polbooks	0.53	0.50	0.52	0.50	0.57	0.57	0.48	0.45	0.59	0.579
Football	0.75	0.919	0.88	0.75	0.90	0.92	0.51	0.89	0.921	0.924

Bold values indicate the best results in the table.

CNM, the method of Clauset, Newman, and Moore; ECES, expanding core nodes using extended similarity; LPA-S, label propagation and similarity; LCCD, location of community structural centers detection and expanding; NALPA, label propagation algorithm based on node ability; RTLCD, robust two-stage local community detection.

Experiments on computer-generated networks

Computer-generated networks are the other type of benchmarks that are commonly used in community detection field. In these benchmarks, the built-in cluster structure can be arbitrarily designed. Here, we use two classes of the computer-generated networks. The first, introduced by Girvan and Newman,⁴⁴ is a group of networks with 128 nodes divided into four communities of 32 nodes. These networks are known as the “GN” benchmark networks. Each node in a GN network has on average an internal degree of Z_in and an external degree of Z_out where Z_in + Z_out = 16. We made several networks for the particular values of Z_out and compared the APAS approach with AP-DK and AP-RE methods on these networks. Figures 1 and 2 show the result of each method on GN networks.

FIG. 1.

NMI and modularity values of AP-DK and APAS (DK) methods on GN networks using negative DK similarity matrix. AP, affinity propagation; APAS, affinity propagation with adaptive similarity; DK, diffusion kernel; GN, Girvan and Newman network; NMI, normalized mutual information.

FIG. 2.

NMI and modularity values of AP-RE and APAS (RE) methods on GN networks using negative RE similarity matrix. RE, regular equivalence.

As shown in Figures 1 and 2, APAS has a higher performance even when the community structure of the GN network is not clear. It outperforms AP-DK and AP-RE in terms of NMI and modularity when applying to GN benchmark networks. The second class of computer-generated networks proposed by Lancichinetti and Radicchi is “LFR” benchmark.⁴⁵ In these networks, the degree distribution and community size follow power law distribution. Changing Lancichinetti–Fortunato–Radicchi (LFR) benchmark parameters, one can build different networks with arbitrarily built-in cluster structures. Some of these parameters are the network size N, average degree of nodes k, maximum degree of nodes maxk, minimum community size minc, maximum community size maxc, and mixing parameter μ. Each node shares a fraction 1-μ of its edges with the nodes of its community and a fraction μ with the nodes of the other communities. Table 4 shows the parameters of the LFR networks used in this study.

Table 4.

Parameters of the Lancichinetti–Fortunato–Radicchi networks used in this study

N	k	max_c	min_c	Mu	max_k
1000	20	50	20	0.1–0.9	50

Mixing parameter is varied from 0.1 to 0.9 to obtain networks with different community structures. Figures 3 and 4 show the results of the methods on the obtained LFR networks.

FIG. 3.

NMI and modularity values of AP-DK and APAS (DK) methods on LFR network using negative DK similarity matrix. LFR, Lancichinetti–Fortunato–Radicchi.

FIG. 4.

NMI and modularity values of AP-RE and APAS (RE) methods on LFR network using negative RE similarity matrix.

When the community structure of the LFR network is strong, APAS performs as accurately as the AP-DK method. As the community structure becomes weak (mixing parameters >0.3), both methods start to weaken, however, APAS manages to have better results than AP-DK. In Figure 4, negative RE similarity metric is used to measure the similarities between nodes of the LFR networks. APAS outperforms AP-RE both in weak and strong community structures.

Experimental results for modularity measure on real-world networks

In this section, the APAS method is compared with other state-of-the-art community detection algorithms on real-world data sets such as Zachary's Karate Club,⁴⁶ Dolphins,⁴⁷ U.S. political books,⁴⁸ Football,⁴⁹ E-mail,⁵⁰ NetScience,⁵¹ Power grid,⁵² Condmat-2005,⁵³ E-mail-Enron,⁵⁴ and pretty-good-privacy (PGP).⁵⁵ Table 5 shows the results of testing the modularity measure on eight real-world data sets. In this table, Q denotes the modularity value and Num shows the number of detected communities. According to the results, in the Karate, Dolphin, and Polbooks data sets, the highest modularity value belongs to Infomap. However, a higher modularity is not an advantage in many data sets such as Karate, Dolphin, Polbooks, and Football. To prove this, if we analyze the result in Tables 1 and 2, it is found that APAS has the best NMI value for Karate and Dolphins. Both APAS (RE) and APAS (DK) do not have any wrong detection in the Karate data set. In this case, the best modularity value is 0.371 for Karate (with two real communities) that belongs to APAS (RE) and APAS (DK). In addition, in the Dolphin data set, APAS (RE) has NMI value 1, and the second-best NMI value is obtained by APAS (DK). In Dolphin, our method has the best number of detected communities with the best real modularity value (Q = 0378). Other compared methods have weak performance in these data sets. In the Polbooks data set, APAS (RE) has the best performance with lower error and real modularity (Q = 0.495). In the Football data set, APAS (DK) has the highest modularity value and best NMI value. In Lesmis, E-mail, NetScience, and Power grid networks, APAS (DK) has the highest modularity among AP-based methods.

Table 5.

Results of modularity (Q) and the number of communities (n) obtained on algorithms on real-world data sets

Data set	AP-DK		APAS (DK)		AP-RE		APAS (RE)
Data set	Q	n	Q	n	Q	n	Q	n
Zachary's Karate⁴⁶	0.387	3	0.371	2	0.358	3	0.371	2
Dolphins⁴⁷	0.440	11	0.474	3	0.449	7	0.378	2
U.S. political books⁴⁸	0.454	11	0.515	4	0.495	5	0.512	4
Football⁴⁹	0.580	13	0.627	12	0.517	14	0.612	11
Lesmis	0.516	13	0.541	7	0.452	7	0.0782	7
E-mail⁵²	0.428	126	0.519	128	0.395	128	0.395	45
NetScience⁵³	0.794	562	0.966	539	0.737	539	0.939	330
Power grid⁵⁴	0.554	1983	0.679	1000	0.536	1000	0.433	1001

Bold values indicate the best results in the table.

On the contrary, we have compared the proposed APAS method with other none AP-based methods on several real-world data sets in Table 6. It is obvious that the best measure for evaluating methods on ground-truth data sets is NMI. Although modularity is not an exact and reliable measure in most data sets, however, the only acceptable criterion for data sets without ground truth is the modularity criterion. The higher the modularity value the better the quality of detected communities. For example, the real modularity for Karate and Dolphin data sets, respectively, is 0.371 and 0.378 in case of 100% correct community detection (CN = 2). However, the modularity of other compared algorithms is higher than real values due to incorrect detections. In the Polbooks data set, Infomap with Q = 0.525 has the highest modularity value. However, the real modularity based on ground truth is Q = 0.4149 with CN = 3. However, APAS (RE) with best accuracy has a lower modularity than Infomap. Hence, it proves that modularity is not a suitable measure in Karate, Dolphin, and Polbooks data sets. In the Football data set, APAS (DK) method has the highest modularity (Q = 0.627) and also the best accuracy in this data set. Besides, APAS is also evaluated on several data sets without ground-truth based on modularity measure. APAS (DK) only has the highest modularity values on NetScience and Condmat-2005 data sets. APAS (RE) only has the best modularity in PGP data set. In most of these data sets, Louvain, CNM, and Infomap have higher modularity than other methods.

Table 6.

Comparing the results of modularity (Q) of affinity propagation with adaptive similarity with other none affinity propagation-based methods for some real-world networks with unknown communities

	NALPA	CNM	Infomap	Louvain	ECES (s = 1)	LCCD	APAS (DK)	APAS (RE)
Zachary's Karate Club⁴⁶	0.371	0.380	0.419	0.414	0.368	0.413	0.371	0.371
Dolphins⁴⁷	0.400	0.495	0.524	0.516	0.475	0.506	0.474	0.378
U.S. political books⁴⁸	0.462	0.501	0.525	0.517	0.492	0.522	0.515	0.512
Football⁴⁹	0.601	0.549	0.607	0.596	0.516	0.601	0.627	0.612
E-mail⁵⁰	0250	0.507	0.523	0.540	0.423	0.538	0.519	0.395
NetScience⁵¹	0.928	0.852	0.74	0.917	0.794	0.913	0.966	0.939
Power grid⁵²	0.781	0.601	0.816	0.841	0.859	0.842	0.679	0.433
Condmat-2005⁵³	0.630	0.633	0.630	0.709	0.536	0.711	0.718	0.494
E-mail-Enron⁵⁴	0.307	0.510	0.519	0.552	0.440	0.545	0.412	0.409
PGP⁵⁵	0.823	0.839	0.748	0.873	0.721	0.867	0.864	0.879

PGP, pretty-good-privacy.

Figure 5 shows the result of APAS (RE), AP-DK, AP-RE, LPA, Louvain, and Fast Newman algorithms on an LFR network with 1000 nodes. Negative DK is used as the similarity metric for APAS method. The figure shows the NMI values of six community detection algorithms. The value of μ is varied in the range of 0.1–0.9 to build networks with different levels of fuzziness of community structure. According to the obtained results, we can claim that the APAS method has considerably high performance in community detection of social networks.

FIG. 5.

NMI values of APAS (RE), AP-RE, AP-DK, LPA, Louvain, and Fast Newman methods on an LFR network with 1000 nodes. LPA, label propagation algorithm.

Selecting similarity measures

APAS takes as input the pairwise similarities between data points. In this method, the obtained result strongly depends on the utilized similarity measure. Using an inappropriate measure can affect badly the outcome communities. Here, we have applied two efficient similarity measures named as “Diffusion Kernel” and “Regular Equivalence.” Figure 6 shows the result of APAS on an LFR network with 1000 nodes utilizing six different similarity measures: DK, RE, Cosine similarity, SimRank, shortest path-based similarity (10), and hamming distance similarity (10). Cosine similarity is a structural similarity measure based on the idea that two nodes in a social network are supposed to be similar if they have many neighbors in common. Denoting Ni as the neighbors of the node i, cosine similarity is defined as follows:

FIG. 6.

NMI values of the APAS method on an LFR network with 1000 nodes utilizing six different similarity measures.

S i m_{c o s i n e} (i, j) = \frac{|N_{i} \cap N_{j}|}{\sqrt{|N_{i}| |N_{j}|}}

(18)

SimRank is another useful similarity measure that says two nodes are similar if they connect to similar nodes. SimRank similarity Sim between two nodes a and b is defined as follows: $S i m (a, b) = \frac{C}{|N_{a}| |N_{b}|} \sum_{i = 1}^{|N_{a}|} \sum_{j = 1}^{|N_{b}|} S (N_{i} (a), N_{j} (b)),$ (19)

where C is a constant between 0 and 1.

Assuming the length of the shortest path between two nodes i and j as dij, the shortest path similarity is defined as follows: $S i m (i, j) = \sqrt{\sum_{k = 1}^{n} {(d_{i k} - d_{j k})}^{2}}$ (20)

Finally, hamming distance similarity is an adjacency matrix-based similarity defined as below: $S i m (i, j) = 1 - \frac{H a m D i s (A_{i}, A_{j})}{d_{i} + d_{j}},$ (21)

where HamDis(A_i,A_j) is the hamming distance between vectors A_i (the ith row of the adjacency matrix A) and A_j, and d_j is the degree of node j.

As we can see, DK and RE measures have the best performance when applying to the APAS method.

Conclusions

In this study, we proposed the APAS method, that is, a new approach of APAS matrix. APAS updates the similarity matrix using the information obtained in each iteration and achieves an asymmetric similarity matrix with the elements indicating the leadership probability of one node over another. This enables APAS to get more accurate clustering results compared with the original AP method. To evaluate the performance of APAS, we applied our proposed method to eight popular real-world networks and two classes of computer-generated networks. The experimental results demonstrate that APAS can effectively detect the community structure of networks. However, there exist some limitations and weaknesses inherited from the AP method. The main weakness of the APAS method is the problem of high time and space complexity when dealing with large social networks. Several scholars have been proposed to solve this problem.^14,18,19 These proposed solutions could be applied to APAS to make it an efficient community detection method in very large networks. Furthermore, along with the AP method, APAS also strongly depends on the utilized similarity measure. Using an inappropriate measure can badly affect the outcome communities. Selecting the best values for the preferences is another challenge in AP. Preferences influence the final number of communities. It is hard to know which values of preferences would give the most optimal communities. In this study, the preferences are set to a common value of median of the whole similarity matrix. Numerical oscillations are the last limitations of the AP method. They may occur when the exchanged messages are getting optimized. These oscillations prevent the algorithm from converging. AP cannot automatically eliminate the oscillations. One solution is to increase the damping factor manually and run AP until the algorithm converges. The other choice is to use a big damping factor close to 1. In this study, we have set the damping factor λ to 0.9 and the value is increased by 0.01 whenever oscillations occur.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Abbreviations Used

References

Berahmand

, Bouyer

. A link-based similarity for improving community detection based on label propagation algorithm. J Syst Sci Complex. 2019; 32:737–758.

Jiang

, Liu

, et al. Community detection in complex networks with an ambiguous structure using central node based link prediction. Knowl Based Syst. 2020; 195:105626.

B-D

, Shen

, Nguyen

, Falkner

. Improved network community detection using meta-heuristic based label propagation. Appl Intell. 2019; 49:1451–1466.

Frey

, Dueck

. Clustering by passing messages between data points. Science. 2007; 315:972–976.

Givoni

, Chung

, Frey

. Hierarchical affinity propagation. arXiv preprint arXiv:12023722. 2012.

Frey

, Dueck

. Clustering by passing messages between data points. Science. 2007; 315:972–976.

Wang

, Gao

. Detecting protein complexes by an improved affinity propagation algorithm in protein-protein interaction networks. JCP. 2012; 7:1761–1768.

Jia

, Jiang

, Yu

, (Eds.). Affinity propagation on identifying communities in social and biological networks. In: International Conference on Knowledge Science, Engineering and Management, 2010, Springer. pp. 597–602.

Liu

, Li

, Zheng

, Sun

Community detection by affinity propagation. Technical Report, 2008.

10.

Liu

, (Ed.). Community detection by affinity propagation with various similarity measures. In: 2011 Fourth International Joint Conference on Computational Sciences and Optimization (CSO), 2011, IEEE.

11.

Lai

, Lu

. Identification of community structure in complex networks using affinity propagation clustering method. Modern Phys Lett B. 2008; 22:1547–1566.

12.

Ding

, Luo

, Shi

, Fang

, (Eds.). Overlapping community detection by kernel-based fuzzy affinity propagation. In: 2010 2nd International Workshop on Intelligent Systems and Applications (ISA), 2010, IEEE.

13.

Lai

, Nardini

, Lu

. Partitioning networks into communities by message passing. Phys Rev E. 2011; 83:016115.

14.

Zhu

, Meng

, Zhou

. Semisupervised clustering for networks based on fast affinity propagation. Math Problems in Eng. 2013; 2013:385265.

15.

Leicht

, Holme

, Newman

. Vertex similarity in networks. Phys Rev E. 2006; 73:026120.

16.

Wang

, Wang

, Han

, Wang

. Semi-supervised hierarchical optimization-based affinity propagation algorithm and its applications. Int J Comput Appl. 2018; 40:1–10.

17.

Wang

, Peng

, Pei

, et al., editors. Detection of Social Groups in Class by Affinity Propagation. In: 2019 IEEE International Conferences on Ubiquitous Computing and Communications (IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and Services (SmartCNS), 21–23 October, 2019.

18.

Guo

, Zhang

. Detecting communities in social networks by local affinity propagation with grey relational analysis. Grey Syst Theory Appl. 2015; 5:31–40.

19.

Jia

, Wang

, Zhang

, Hua

X-S

, (Eds.). Finding image exemplars using fast sparse affinity propagation. In: Proceedings of the 16th ACM International Conference on Multimedia, 2008, ACM.

20.

Wang

, Zhao

, Wang

, et al. PANDA: Protein function prediction using domain architecture and affinity propagation. Sci Rep. 2018; 8:3484.

21.

Zhang

, Wang

, Norvag

, Sebag

, editors. K-AP: Generating specified K clusters by efficient affinity propagation. In: 2010 IEEE 10th International Conference on Data mining (ICDM), 2010, IEEE.

22.

Luo

, Zhang

, Jiang

, et al. Local community detection with the dynamic membership function. IEEE Trans Fuzzy Syst. 2018; 26:3136–3150.

23.

Guo

, Zhang

. Detecting communities in social networks by local affinity propagation with grey relational analysis. Grey Syst Theory Appl. 2015; 5:31–40.

24.

Wang

, Zhang

, Li

, et al. Adaptive affinity propagation clustering. arXiv preprint arXiv:08051096. 2008.

25.

Guo

W-F

, Zhang

S-W

. A general method of community detection by identifying community centers with affinity propagation. Physica A Stat Mech Appl. 2016; 447:508–519.

26.

Shang

, Luo

, Zhang

, et al. A multiobjective evolutionary algorithm to find community structures based on affinity propagation. Physica A Stat Mech Appl. 2016; 453:203–227.

27.

Raghavan

, Albert

, Kumara

. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E, 2007; 76:036106.

28.

Blondel

, Guillaume

J-L

, Lambiotte

, Lefebvre

. Fast unfolding of communities in large networks. J Stat Mech. 2008; 2008: P10008.

29.

Newman

MEJ.

Fast algorithm for detecting community structure in networks. Phys Rev E. 2004; 69:066133.

30.

, Huang

, Wang

, Chen

. Stepping community detection algorithm based on label propagation and similarity. Physica A Stat Mech Appl. 2017; 472:145–155.

31.

Zhang

, Liu

, Zhu

, et al. NALPA: A node ability based label propagation algorithm for community detection. IEEE Access. 2020; 8:46642–46664.

32.

Wang

, Liu

, Li

, Nees

. Locating structural centers: A density-based clustering method for community detection. PLoS One. 2017; 12: e0169355.

33.

Berahmand

, Bouyer

. LP-LPA: A link influence-based label propagation algorithm for discovering community structures in networks. Int J Mod Phys B. 2018; 32:1850062.

34.

Ding

, Zhang

, Yang

. A robust two-stage algorithm for local community detection. Knowl Based Syst. 2018; 152:188–199.

35.

Berahmand

, Bouyer

, Vasighi

. Community detection in complex networks by detecting and expanding core nodes through extended local similarity of nodes. IEEE Trans Comput Soc Syst. 2018; 5:1021–1033.

36.

Tasgin

, Bingol

. Community detection using boundary nodes in complex networks. Physica A Stat Mech Appl. 2019; 513:315–324.

37.

Zarezade

, Nourani

, Bouyer

. Community detection using a new node scoring and synchronous label updating of boundary nodes in social networks. J AI and Data Mining. 2020; 8:201–212.

38.

Liu

. Community detection by affinity propagation with various similarity measures. In: 2011 Fourth International Joint Conference on Computational Sciences and Optimization. Yunnan, China: IEEE, 2011, pp. 182–186.

39.

Jia

, Jiang

, Yu

. Affinity propagation on identifying communities in social and biological networks. In: Bi Y, Williams MA, (Eds.). Knowledge Science, Engineering and Management. Northern Ireland, UK: Belfast, 2010.

40.

Kondor

, Lafferty

, (Eds.). Diffusion kernels on graphs and other discrete input spaces. In: ICML’02: Proceedings of the Nineteenth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann Publishers, Inc., 2002, pp. 315–322.

41.

Leicht

, Holme

, Newman

MEJ

. Vertex similarity in networks. Phys Rev E. 2006; 73:026120.

42.

Danon

, Duch

, Diaz-Guilera

, Arenas

. Comparing community structure identification. J Stat Mech Theory Exp. 2005; 2005: P09008.

43.

Newman

MEJ

, Girvan

. Finding and evaluating community structure in networks. Phys Rev E. 2004; 69:026113.

44.

Girvan

, Newman

, (Eds.). Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002; 99:7821–7826.

45.

Lancichinetti

, Fortunato

, Radicchi

. Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008; 78(4 Pt. 2):046110.

46.

Zachary

WW.

An information flow model for conflict and fission in small groups. J Anthropol Res. 1977; 33:452–473.

47.

Lusseau

, Schneider

, Boisseau

, et al. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol. 2003; 54:396–405.

48.

Newman

ME.

Modularity and community structure in networks. Proc Natl Acad Sci USA. 2006; 103:8577–8582.

49.

Girvan

, Newman

. Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002; 99:7821–7826.

50.

Guimera

, Danon

, Diaz-Guilera

, et al. Self-similar community structure in a network of human interactions. Phys Rev E. 2003; 68:065103.

51.

Newman

ME.

Finding community structure in networks using the eigenvectors of matrices. Phys Rev E. 2006; 74:036104.

52.

Watts

, Strogatz

. Collective dynamics of “small-world” networks. Nature. 1998; 393:440.

53.

Newman

ME.

The structure of scientific collaboration networks. Proc Natl Acad Sci USA. 2001; 98:404–409.

54.

Leskovec

, Kleinberg

, Faloutsos

, (Eds.). Graphs over time: Densification laws, shrinking diameters and possible explanations. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, ACM.

55.

Boguná

, Pastor-Satorras

, Díaz-Guilera

, Arenas

. Models of social networks based on social distance attachment. Phys Rev E. 2004; 70:056122.