Abstract
In social network analysis, identifying the important nodes (key nodes) is a significant task in various applications. There are three most popular related tasks named influential node ranking, influence maximization, and network dismantling. Although these studies are different due to their own motivation, they share many similarities, which could confuse the non-domain readers and users. Moreover, few studies have explored the correlations between key nodes obtained from different tasks, hindering our further understanding of social networks. In this paper, we contribute to the field by conducting an in-depth survey of different kinds of key nodes through comparing these key nodes under our proposed framework and revealing their deep relationships. First, we clarify and formalize three existing popular studies under a uniform standard. Then we collect a group of crucial metrics and propose a fair comparison framework to analyze the features of key nodes identified by different research fields. From a large number of experiments and deep analysis on twenty real-world datasets, we not only explore correlations between key nodes derived from the three popular tasks, but also summarize insightful conclusions that explain how key nodes differ from each other and reveal their unique features for the corresponding tasks. Furthermore, we show that Shapley centrality could identify key nodes with more generality, and these nodes could also be applied to the three popular tasks simultaneously to a certain extent.
Introduction
Social network analysis is a wide and hot research topic that uses graph structures to represent people and their relationships from both real and virtual worlds. Different people own different social circles, leading to different individuals possessing different positions and importance in a network. For example, a person who owns 1,000,000 followers on Twitter is usually more important to the network than a person with 100 followers. In social network analysis, identifying a set of key nodes is a very important task which inspires various research studies with different motivations. In real life, key nodes are generally recognized as celebrities in social activities, such as business leaders, research experts, outstanding players, movie stars and famous politicians. Identifying these key nodes can be useful for many applications, e.g. online advertising [1], expert finding [2], virus propagation prevention [3] and network anomaly detection [4]. Currently, there are three relevant popular key node research fields in social network analysis, i.e. Influential Node Ranking (also Node Ranking), Influence Maximization and Network Dismantling. Node ranking aims to rank influential nodes correctly (especially high influential nodes). Influence maximization [5, 6] focuses on maximizing the coverage of information through dissemination with a small number of initial spreaders. Network dismantling [7] asks to break the network into many disconnected pieces by removing a minimal group of nodes.
Although lots of methods have been proposed for these three fields, they have two inherent drawbacks: (1) Non-expert readers may easily confuse between these closely related domains. In fact, these three kinds of studies are intuitively quite similar in that they all intend to identify some kind of key nodes. Also, the same concepts may be adopted and improved in these studies to evaluate the importance of nodes, such as degree centrality and community structures. Moreover, readers could be misled when the titles of these papers give a wrong impression. For example, some papers titling “identifying influential nodes/spreaders” actually deal with the task of influence maximization [8, 9, 10, 11]. Also, the term “top-k influential nodes” in [12, 13] has different meanings, where one [12] implies Node Ranking and the other [13] refers to Influence Maximization. Particularly, a quite notable paper titled “Influence Maximization in Complex Networks Through Optimal Percolation” [14] published in Nature essentially presents a novel method for network dismantling. (2) Non-expert users may feel overwhelmed when trying to get key nodes in networks without a specific purpose since these methods cannot answer a simple but general question, i.e., what are the key nodes in networks? Although many existing studies focus on key nodes, each of these studies rather identifies key nodes with its own single motivation and only related metrics, resulting in that the identified key nodes are biased. Therefore, they are unable to give a general idea of key nodes since the concept of key nodes is limited in their own tasks.
In addition, few studies have explored the correlations between these three key node relevant tasks. For example, are the identified key nodes from these three tasks similar to each other? If they are similar, could the identified key nodes from one of the tasks be also applied to another? If not, what are the unique features behind these different key nodes that cause their limited adaptability to their own tasks? Intuitively, Influential Node Ranking and Influence Maximization are both related to a node’s influence, and thus they could be similar. Also, since the activation of the key nodes from Influence Maximization could maximize the coverage of information spread, the removal of them may also block the spread and thus dismantle a network. By analogy, since the removal of the key nodes from Network Dismantling could limit the information outbreak to a sub-extensive size, the activation of them may also benefit the spread of influence. In other words, these two kinds of key nodes may all act as hubs throughout a network. Although Radicchi et al. [15] have studied the difference between key nodes from Influence Maximization and Network Dismantling, the comparison algorithms they chose are not representative and the datasets they used are rather small in size.
In this paper, we clarify and formalize these three research tasks, as well as present their motivation and metrics respectively. Through comparing and analyzing the features of key nodes derived from different tasks, we collect and build a series of crucial metrics under a uniform standard. Based on the above, we propose a comparison framework that is a fair measurement system which generally evaluates a key node set from different aspects. With the proposed framework, we further evaluate the key nodes identified by existing studies to explore the correlations between them. Besides, we also discover a cooperative-game-theory based network centrality that could identify the most representative key nodes under the general comparison framework.
There are some challenges in this work compared with previous studies. First of all, though many studies have been focused on these three popular fields, there is still a deficiency in describing and formalizing them under a uniform standard. Second, although these three tasks have their own different motivation, the correlations between their identified key nodes remain unknown. Third, key nodes identified by different research studies may be distinct in features, such as ordering, output sizes and dependent models. These differences make it a hard task to evaluate them fairly and generally. At last, exploiting a method that identifies a group of more general key nodes is still a gap for existing research.
Overall, the main contribution of our work is to solve the above problems by conducting an in-depth survey of different kinds of key nodes, which is elaborated point by point as follows.
There are already some excellent studies concentrating on reviewing one of these three research fields. For example, Lü et al. [17] extensively review the approaches in influential node ranking, emphasizing on the physical concepts and related methods. Arora et al. [18] conduct an in-depth benchmarking study w.r.t. the start-of-the-art algorithms in influence maximization while Li et al. [19] review these algorithms theoretically and give a rigorous theoretical comparison of them. Wandelt et al. [20] compare the algorithms from network dismantling on many different types of networks, the results of which could serve as a reference for selecting an appropriate network dismantling method for a given network, considering both accuracy requirements and run time constraints.
However, since our work aims to explore the correlations between key nodes from three different tasks rather than compare algorithms within the same task, we only give a brief review on the existing algorithms, focusing on analyzing and classifying the basic ideas behind them. As shown in Fig. 1, we summarize existing key node relevant research studies, giving attention to their approaches that deal with their corresponding tasks. We also introduce some of their representative algorithms.
Approaches and some representatives for identifying key nodes.
Algorithms for node ranking usually adopt classic network centralities and could be roughly classified into the local-view and the global-view. LocalRank [21] and H-index [22] are two classic local-view approaches which extend the degree centrality. Local-view approaches are fast and simple but less effective when used alone because they only focus on the local neighborhood. In terms of the global-view, global centralities often serve as the foundations. For example, LeaderRank [23] improves the classic PageRank, KS-IF [24] is an refinement of k-core and GLR [25] reduces the computational complexity of closeness.
Since this task aims to rank nodes on the basis of individuals’ influence spread, estimating single node influence is undoubtedly a feasible choice for this task. Chen et al. [26] summarize this kind of approach as single node influence (SNI) centrality and proposed ASNI-RR algorithm to fast estimate single node influence based on the reverse reachable (RR) set [27, 28, 29].
Influence maximization algorithms
Dealing with influence maximization, approximation algorithms and estimation algorithms are two kinds of main approaches. Based on the submodularity and monotonicity of this task, Kempe et al. [30] present the first approximation algorithm (Greedy) which iteratively picks node that could bring maximal gain in terms of influence as a seed node. The approximation algorithms could guarantee their performance to be at least
On the other side, estimation algorithms tend to estimate a score rather than influence spread for each node. Combined with certain seed selection strategies, these algorithms could be quite fast but without theoretical guarantees, such as DegreeDiscount [31], PMIA [34], EasyIM [35] and FNS [36].
Network dismantling algorithms
To dismantle a network, some algorithms iteratively remove nodes with the highest centrality, which we summarize as the node-level approach. The key to these algorithms is to design a structure-relevant centrality which could be calculated in a short time. Collective Influence (CI) [14] is a classic node-level method which presents its centrality based on optimal percolation. Also, Clusella et al. [37] propose Explosive Immunization (EI) algorithm that combines explosive percolation paradigm and the idea of maintaining a fragmented distribution of clusters. It is worth mentioning that the Recalculated Betweenness (RBW) [38] has shown great effectiveness in a comparative analysis [20] but demand heavy time cost.
The graph-level approaches firstly decycle (eliminate all cycles) a network into trees and then easily dismantle these trees with a greedy strategy. The key of these algorithms is to fast decycle a network with minimal cost. The representatives of this kind of approaches are Min-Sum [7], CoreHD [39] and BPD [40].
Tasks and their motivations
Social network and information propagation model
A social network with
In information dissemination, each edge
Influential node ranking
Influence Node Ranking (NR) aims to sort nodes monotonically in a network, whose result should be consistent with the ranking of single node influence as much as possible. Let
The Kendall’s coefficient
Improving existing centralities and estimating single node influence are two common directions in NR studies. For example, k-core centrality is frequently adopted and modified in the first direction of studies as it could identify a set of nodes that are tightly interlinked as the core of the graph. However, It could be easy for the second direction of algorithms to get a good ranking result as long as they could estimate
If required to present a set consisting of
Influence Maximization (IM) aims to find a node set
Compared with node ranking, influence maximization concentrates on the collective influence which is closely associated with marginal gain rather than single node influence. The marginal gain of a node
It has been proved that the influence function
It is worth noting that the Shapley value [16], derived from a fundamental cooperative-game theory, formalized as Shpaley centrality and implemented by ASV-RR [26], evaluates the importance of nodes from a unique perspective. The Shapley centrality measures a node by its expected marginal contributions over arbitrary node permutations and provides each node with a score. ASV-RR could be viewed as an NR algorithm which concentrates more on the expected marginal gain.
If required to present a set consisting of
Network Dismantling (ND) aims to find the minimal set of nodes whose removal leaves the network broken into many disconnected components with the size of their largest one limited to a preset threshold
The threshold
Decycling is an efficient technique to make ND task much easier. Decycling seeks to find a subset of nodes whose removal could leave a network acyclic, i.e. eliminating all cycles in a network. An acyclic graph could be easily dismantled by adopting a greedy tree-breaking strategy [7]. Besides, reinsertion is a critical step in ND algorithms. No matter how ND algorithms remove nodes to make
Unfortunately, it is impossible to require ND algorithms to present a set consisting of exactly
NR, IM and ND all intend to identify key nodes in a network with their own motivation and focus. In summary, NR aims to acquire the ranking of single node influence for all nodes (especially high influential nodes) whether actually calculates them or not. IM aims to find a given number of nodes whose co-acting influence spread would be maximal. ND tends to disintegrate a network by removing a minimal set of nodes from it. The features of these three key node sets are compared in the following Table 1.
Features of key nodes identified from three tasks
Features of key nodes identified from three tasks
First of all, key nodes identified by NR and IM are ordered while ND is not. Secondly, the number of key nodes is totally different under the three tasks. Thirdly, the correctness of key nodes identified by an NR algorithm is available while IM and ND are not. At last, NR and IM rely on the propagation model while ND is independent of the spreading model. Thus, building a fair and general comparison framework under which different kinds of key node sets could be evaluated together is a challenging work.
In this section, we present our comparison framework which fairly evaluates a key node set
General Comparison Framework. 
First, regarding the task of node ranking, key nodes of NR are viewed as the baseline. Since key nodes from ND have no ordering, i.e. all of them are almost equally important, it is only fair to include all of them in comparison. Thus, we use Jaccard index instead of the
where
Besides, we also test the top NR frequency for a key node set. Specifically, for certain proportions (e.g.
Though both Jaccard Index and top NR frequency focus on the union part of two sets, they calculate the similarity from different aspects. Jaccard Index generally evaluates the similarity between
For the task of influence maximization, influence spread should be undoubtedly involved in comparison. Similarly, for each network, top
Moreover, we also concern the relations within a key node set. The relations refer to the number of edges existing between these nodes. As rich-club effect (a small number of nodes with large numbers of links are very well connected to each other) [42] is vital to the IM task, it makes great sense to check the relations and how they affect the influence spread. The relations in a key node set
In view of the network dismantling task, we check the ability of a key node set to dismantle networks, namely, we check the size of the remaining largest connected component when top
The smaller the
Since reinsertion is a critical step for ND algorithms, we also keep removing nodes from the top in
Datasets and algorithms
In this subsection, we introduce twenty testing networks with different sizes and categories adopted for experiments. The summary of these datasets is shown in Table 2. All networks are pre-processed and their largest connected component is used instead of the original networks since an initially fragmented network would greatly reduce the significance of the ND task. The numbers of nodes and edges are represented by
Summary of twenty testing networks
Summary of twenty testing networks
(i) Dolphins (DP) [43]: This is a network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. (ii) Poolbooks (PB):1
(iii) Football (FB) [44]: This is a network of American football games between Division IA colleges during regular season Fall 2000.
(iv) Airline (AL):2
(v) NetScience (NS) [45]: This is a collaboration network compiled by M. Newman in May 2006, covering scientific collaborations between scientists working on network theory and experiment.
(vi) USAir (UA) [46]: This network depicts the air travel connections among 500 US airports with the largest amount of traffic from publicly available data.
(vii) Email (EM) [47]: This is an email communication network at the University Rovira i Virgili in Tarragona in the south of Catalonia in Spain.
(viii) Bible (BB):3
(ix) Hamster (HS)
(x) Yeast (YE) [48]: This is a biology network consists of protein-protein interactions.
(xi) CaGrQc (CG) [49]: This is a collaboration network from the arXiv, covering scientific collaborations between authors of papers submitted to General Relativity and Quantum Cosmology category.
(xii) Gnutella06 (GT) [49, 50]: This network is the snapshot of the Gnutella peer-to-peer file sharing network from August 2002 where nodes represent Gnutella hosts.
(xiii) CaHepTh (CH) [49]: This is a collaboration network from the arXiv, covering scientific collaborations between authors of papers submitted to High Energy Physics – Theory category.
(xiv) PGP (PP) [51]: This is the interaction network of users of the Pretty Good Privacy (PP) algorithm.
(xv) CaCondMat (CC) [49]: This is a collaboration network from the e-print arXiv, covering scientific collaborations between authors of papers submitted to Condense Matter category.
(xvi) CAIDA (CA) [49]: This is a network of autonomous systems of the Internet connected with each other from the CAIDA project collected in 2007.
(xvii) IntTopo (IT) [52]: This is a network of connections between autonomous systems, i.e. collections of connected IP routing prefixes controlled by network operators, of the Internet.
(xviii) Brightkite (BK) [53]: This network depicts user-user friendships from Brightkite, a former location-based social network where users share their locations.
(xix) WordNet (WN) [54]: This is a lexical network containing words from the WordNet dataset.
(xx) Douban (DB):4
In order to explore the correlations between key nodes derived from three popular research studies, we choose their representative algorithms with both effectiveness and efficiency. Throughout the whole experiments, ASNI-RR (shorten as SNI) [26] which approximates real single node influence with high probablity is chosen for Node Ranking (NR), IMM [27] which guarantees a
Intuitively, top NR nodes, SHP nodes and IM nodes are all related to the influence spread, which may cause their high similarity. However, top NR nodes usually possess a high degree and thus tend to connect with each other (rich-club effect) while IM nodes focus on collective influence and thus may tend to be scattered. Besides, removal of ND nodes leads to the fragmentation of the whole network and thus they may also scatter throughout the network. Moreover, since the activation of IM nodes causes the maximal of influence spread, the removal of them may also block the spread and thus dismantle a network, and so are the ND nodes.
Jaccard index between key node sets
Jaccard index between key node sets
As NR, IM and ND all identify the key nodes in a network, it is intriguing to check the intersections between them. We calculate Jaccard Index for key node sets identified by SHP, IMM and CHD with SNI as the baseline. We also calculate Jaccard Index for IMM and CHD to explore the similarity between IM and ND. The comparing size
Though SHP evaluates the importance of individuals by expected marginal gain, it still shows a high similarity with SNI. Surprisingly, CHD nodes and SNI nodes are also similar at a certain level while their motivations are irrelevant. Besides, even SNI and IMM are both designed based on the influence spread, there is not so much intersection between them. This somehow reflects that finding influential individuals and maximizing group influence are essentially two different tasks. Also, IMM nodes and CHD nodes are distinct from each other the most even they are similar intuitively.
Next, we calculate the top NR frequency for
Top SNI frequency of SHP, IMM and CHD.
Several findings can be taken from the above results. First, SHP nodes still show a high level of similarity with SNI nodes. Second, CHD nodes appear more frequently in top SNI nodes than IMM nodes, which implies that the nodes requested removal to dismantle a network are usually located in top ranks. At last, even provided with an adequate amount of top SNI nodes (e.g. top 10% for FB, UA, CG and BK, top 20% for AL, EM, YE and CC), the main body of IMM nodes is still unable to be covered. This also reveals that IM concerns more about how much marginal gain could an individual bring rather than how much influence spread it could achieve by itself.
In view of IM, we first run the Monte-Carlo simulations 10,000 times to approximate the influence spread of each key node set with a size of
Influence spread of key node sets
Influence spread of key node sets
Undoubtedly, IMM nodes achieve the best influence spread as that is what IM algorithms designed to maximize. Besides, due to that Shapley centrality measures the expected marginal contributions of an individual over arbitrary permutations, SHP nodes perform much better than SNI and CHD nodes. SNI and CHD are equally poor in influence spread even though their nodes possess different characteristics. Top SNI nodes possess high individual influence while CHD nodes are comparatively scattered in the networks. We may infer that just having the high individual influence alone or the scattered distribution of a key node set alone is not sufficient for maximizing the influence spread. Both are needed.
Then, we look into the relations within each key node set (i.e. the number of edges that connect these key nodes, as calculated in Eq. (7)) and how it affects the influence spread. The more relations in a node set, the more rich-club effect it may bring, and thus reduce the influence spread. For easier comparison, we show the ratios rather than the actual numbers of relations within SNI, SHP and CHD nodes to that within IMM nodes.
Ratios of relations between key node sets.
Obviously in Fig. 4, IMM nodes have much fewer relations than SNI, SHP and CHD nodes. The average ratios of SNI/IMM, SHP/IMM and CHD/IMM are 2.59, 1.83 and 1.85 respectively. SNI nodes have the most relations and it somehow explains why they act worse in terms of influence spread even though SNI is designed on the basis of single node influence. However, the relations in SHP and CHD nodes are rather close while SHP is much better than CHD in terms of the influence spread. It has strengthened our inference that single node influence and node distribution are both important for IM, lacking either of them will fail in this task.
The dismantling ability of SNI, SHP, IMM and CHD nodes are checked and shown in Table 5 with a size
Dismantling ability of key node sets
Dismantling ability of key node sets
Predictably, CHD is the best at dismantling a network as that is what ND algorithms designed for. Surprisingly, SHP beats both SNI and IMM in terms of the dismantling ability yet SHP nodes have shown to be similar with SNI nodes (regarding Jaccard index and top SNI frequency) and IMM nodes (regarding influence spread). IMM nodes are worst at this task, which once again reveals that IM and ND are totally different even that their tasks are similar intuitively.
The dismantling ability of CHD nodes in Table 5 is far better than other methods. The reason is that CHD contains the reinsertion step, which is a critical step for the ND task. Therefore, it is only fair for SNI, SHP and IMM to dismantle a network with the reinsertion step just like CHD. Thus, we check the minimum removal for SNI, SHP and IMM. Practically, we keep removing top SNI, SHP and IMM nodes until
Minimum removal of key node sets
In Table 6, the number indicates how many nodes that need to be removed to dismantle the network with the reinsertion step. The less the number, the better the performance. Though with reinsertion, IMM is still worse at this task. SNI and SHP show good performances and they are sometimes even better than CHD. We also include another ND algorithm, i.e. Recalculated Betweenness (RBW) [38] which achieves the smallest minimum removal in comparison. The RBW has shown to be superior among ND algorithms while rather time consuming [20]. As already revealed by Jaccard Index, NR nodes are more similar to ND nodes, and thus dismantling a network by removing top SNI and SHP nodes is kind of feasible.
Overall, several significant findings could be drawn from the whole experiment above. First, NR only focuses on the evaluation of every single individual while IM and ND both tend to achieve different kinds of collective effects. The nature of these two kinds of tasks is basically different. Second, though NR and IM are both designed to pursue the influence spread, these two key node sets are not that similar. Merely identifying influential nodes makes no sense for IM task as IM actually concentrates on the marginal gain of influence spread, which may result in a scattered node distribution. Third, though both related to information dissemination, IM and ND are unable to swap their jobs. IM nodes are worse at dismantling a network and ND nodes are also not good at maximizing the collective influence. At last, Shapley centrality shows a surprising capacity that SHP nodes are similar with SNI, IMM and CHD nodes simultaneously, and they could be applied to all of the three tasks with good performances.
Intrigued by these phenomena, we are curious about what causes all of the above. Thus, in this subsection, we focus on these key node sets themselves rather than their devoted tasks. Some more experiments are conducted to further explore the features of these different key node sets, including degree distribution, subgraph feature, and key nodes visualization.
Degree distribution
As the degree is practically the most basic and important centrality, we investigate the average degree of key node sets identified by SNI, SHP, IMM and CHD. For each network, we also select
Distribution of average degree for key nodes.
The distribution of average degree for these key nodes on 20 networks shows a strong regularity. Apparently, the
Whether for these key node sets themselves or for their neighbors, the differences between SNI, SHP, IMM and CHD are not that large. The reason may be that there is a bunch of high-degree nodes that are identified as key nodes in all four methods. As we are more interested in the differences between these different key node sets, we calculate the overlapping part of these four sets and then separate them from each set. Similar to Fig. 5, we check both the average degree and the mean value of the neighbors’ average degree for non-overlapping key nodes in each method. For better comparison and analysis, we also check these two measures for the overlapping part. Corresponding results are shown in Fig. 6, where OVP represents the overlapping nodes.
Distribution of Average Degree for Non-overlapping Key Nodes. OVP denotes the overlapping part identified by SNI, SHP, IMM and CHD simultaneously. The 
In Fig. 6, the differences between the four key node sets become more evident when the overlapping part is excluded. The average degree of these key nodes still follows SNI
In addition, we calculate the mean value of the two measures on twenty networks. Results are shown in Table 7. The “Case” indicates that whether the overlapping part is excluded from the key node sets (“OVP” means the overlapping part is excluded and “NOVP” means the opposite).
Mean distribution of average degree for key nodes on all networks
Generally, these mean results are in accordance with our former observation. When the overlapping part is not excluded (first two lines in Table 7), SNI shows the highest value in both two measures, proving that most SNI nodes are these clustered high-degree nodes in the networks. In contrast, IMM is the lowest in both. Compared with CHD, the degree of SHP nodes is even higher while the degree of their neighbors is somehow lower. Hence, there are more high-degree nodes in SHP while these nodes may not be so clustered. When only concentrating on the non-overlapping key nodes (last two lines in Table 7), the
For SNI, SHP, IMM and CHD, we extract their corresponding subgraphs constituted of only key nodes and edges between them. Therefore, on each network, the number of nodes in four subgraphs is the same while the number of edges varies.
First, to check the clustering level of different key nodes, we count the number of triangles in these subgraphs. As already shown in Fig. 4, the SNI nodes have the most relations. Thus, analogously, we calculate the ratios of triangle number within SHP, IMM, CHD to that within SNI. Results are shown in Fig. 7.
Except for Dolphins network, SNI always has the largest number of triangles, showing a high clustering structure. As previously inferred, IMM nodes are scattered and relatively lower in degree. Thus the number of triangles in IMM is always the smallest (averaging 0.2791). Since the number of relations in CHD is a little more than that in SHP, it is interesting to see that the number of triangles in SHP (averaging 0.6765) is slightly bigger than that in CHD (averaging 0.6101) in most cases. As shown in Table 7, although SHP nodes have a higher average degree than CHD nodes, their neighbors are relatively lower in degree than the neighbors of CHD, which leads to our previous assumption that SHP nodes do not tend to link with each other. In summary, SHP nodes have a high average degree, fewer edges but with more triangles, and lower-degree neighbors. We thus infer that SHP nodes consist of two parts, one is these nodes that are relatively clustered in the high-degree area of a network, and the other one is some high-degree nodes that are rather isolated. This could explain even with a high degree and a larger number of triangles, SHP nodes still possess lower
Number of connected components in key-node subgraphs
Number of connected components in key-node subgraphs
Distribution of triangles in key-node subgraph.
Then, to further explore the topology structure of these subgraphs, we calculate the number of connected components in each subgraph as well as the size of their largest one. Results are shown in Table 8. The digit denotes the number of connected components and the fraction in brackets stands for the ratio of the size of the largest connected component towards
As previously pointed out, SNI nodes are those clustered high-degree nodes and IMM nodes are rather scattered. Therefore, the number of connected components in SNI subgraphs is the smallest and that in IMM subgraphs is the biggest. Also, the largest connected component in SNI subgraphs is pretty huge, almost always bigger than 0.99. In contrast, the largest connected component in IMM subgraphs is smaller. For SHP and CHD, the size of their largest connected components is close. However, in most networks, the number of components in SHP subgraphs is bigger than (or at least equal to) that in CHD subgraphs. It agrees with our inference that SHP nodes consist of a majority of clustered nodes and some totally isolated nodes.
Key nodes visualization on airline network.
To study the distribution pattern of different key nodes in networks, we draw the topology of these networks and mark SNI, SHP, IMM, and CHD nodes with red, purple, green, and cyan. We also painting edges that link these key nodes with corresponding colors. As most of them showing the common patterns, we only exhibit two networks in this subsection, i.e., Airline network (Fig. 8) and CaGrQc network (Fig. 9).
Key nodes visualization on CaGrQc network.
Some common patterns can be drawn from Figs 8 and 9. First, most key nodes are located in the center areas of the networks. The center area usually consists of many clustered high-degree nodes. Second, SNI nodes are heavily clustered while IMM nodes are much more scattered. Compared with IMM, the distributions of SHP and CHD nodes are more similar to SNI, except that they are relatively sparser in the center areas. Last, although the distributions of SHP and CHD nodes show the most resemblance, there are a number of isolated nodes in SHP located in the outer layer of the network, especially in Fig. 9b.
In summary, NR nodes are mainly these clustered high-degree nodes in the center of networks. The key to node ranking algorithms is thus sorting these high-degree nodes more reasonably. IM nodes are scattered throughout the network and thus there are fewer edges between them. This also causes the average degree of IM nodes much lower since high-degree nodes tend to link with each other. ND nodes are actually clustered rather than scattered, indicating that network dismantling essentially extracts a “core chain” to break a network into pieces rather than finding the so-called “weak nodes” [14]. SHP nodes are quite special. They are similar to ND nodes in the center of networks while some of them are isolated just like IM nodes. Therefore, SHP nodes could simultaneously handle both IM and ND task.
Dolphins network
In this subsection, we take the Dolphins (DP) network as a specific case study to further analyze these key node sets and their features. Figure 10 shows the network topology as well as the distribution of different key node sets.
Key Nodes in Dolphins Network (
As in Fig. 10a–c, top 20 NR key nodes identified by SHP nodes totally overlap with either SNI, IMM or CHD nodes. In comparison, the top 20 IMM nodes are overall scattered while CHD nodes are more clustered in center areas. Due to the complicated network structures, it is unrealistic to cut connections between two parts by only removing a single node. Thus, it has to remove a group of clustered nodes locally to break down the topology. This phenomenon also explains why CHD nodes act badly at influence spread though they are the most similar with SHP nodes.
Although SHP nodes and SNI nodes overlap the most, they are indeed different if their orderings are taken into consideration. As revealed by Fig. 10d, top 5 SNI nodes locate in a limited scope of the network while top 5 SHP nodes are a lot more scattered. This phenomenon explains why SHP nodes could handle IM task well even they show a high similarity with SNI nodes.
In this subsection, we take a collaboration network consisted of 80 experts in Data Mining field as the second case study. These experts are selected according to their h-index from the AMiner5
The DBLP collaboration network in data mining field.
In Table 9, we show top 10 authors identified by SNI, SHP and IMM respectively. The number in bracket denotes the ranking of an author in terms of degree. There are 80 nodes in this network, and CHD includes 27 of them. Except for Svetha Venkatesh (the 8th author in IMM) and Rakesh Agrawal (the 10th author in SNI), all authors shown in Table 9 are identified by CHD to dismantle this network.
Top 10 authors from DBLP network identified by SNI, SHP and IMM
Intersection between one set and the union of the others
As shown by the rankings, SNI is essentially resorting top-degree nodes. Actually, the top 15 authors in SNI are also the 15 authors with the largest degree in this network. IMM always selects the node that could bring maximal marginal gain, thus some lower-degree nodes rank even higher than those high-degree nodes. For example, Johannes Gehrke, whose degree ranked 32nd, located at 5th in IMM because he has no collaboration with the first four authors. Also, Ee-Peng Lim, whose degree ranked 26th, located at 7th in IMM because he has only collaborated with Philip S. Yu in the first six authors. Although top 10 authors in SHP are also high-degree nodes, their orders are more in accordance with IMM rather than SNI. The sequence of the top 6 authors in SHP is almost the same as their sequence in IMM. Besides, these top 10 authors in SHP are all included in CHD to accomplish the network dismantling task.
As revealed by previous experiments, SHP nodes could be applied to NR, IM and ND tasks simultaneously with good performances. The reason may lies that SHP nodes totally overlap with the rest of key node sets. In other words, SHP may select the core parts from the union of them. Inspired by this assumption, we further check how many SHP nodes actually belong to the union of the other key node sets and how frequently that a key node derived from the intersection of SNI, IMM and CHD nodes is also a SHP node. For comparison, we also check these for SNI, IMM and CHD nodes in case that overlapping is a common phenomenon for all key nodes. Results are shown in Tables 10 and 11.
Intersection of the other three key node sets
Intersection of the other three key node sets
Correlations between key node sets.
In Table 10,
According to the results, we describe the relationship between these four key node sets via a Venn diagram in Fig. 12. Obviously, almost all SHP nodes belong to the shaded part and almost all shaded part with meshes belong to SHP nodes, indicating that SHP nodes are actually a combination of core key nodes from NR, IM and ND tasks.
In this paper, we focus on key nodes and their features in social networks. Particularly, three popular research fields that identify key nodes are taken into considerations, namely influential node ranking (NR), influence maximization (IM) and network dismantling (ND). However, we find that non-expert readers may easily confuse between these three close domains and none of the existing studies aim to provide a set of general key nodes. Besides, though many studies have been devoted to finding key nodes in these three fields, the correlations between their identified nodes remain unknown. Thus, we conduct an in-depth survey of key nodes in social networks, so as to make researchers get a clear idea of different key node-relevant tasks and improve their deep understanding of these key nodes. First, we clarify and formalize the three tasks under a uniform standard, as well as summarize the features of their identified key node sets. We then propose a fair comparison system that could generally evaluate a key node set from different aspects. Through comprehensive experiments, we explore correlations between some representative key node sets and give insightful conclusions to the three tasks behind them.
NR focuses on evaluating an individual’s independent influence while IM and ND tend to achieve collective effects, which indicates they are basically different. NR nodes mainly consist of these clustered high-degree nodes in the center of networks. The key to NR algorithms is thus sorting these high-degree nodes more reasonably. Although NR and IM both make use of influence spread, it is not appropriate to address influence maximization as finding influential spreaders/nodes. Actually, IM pays much more concern on the marginal contributions of a node, leading to that the influence spread of selecting Even IM and ND are similar intuitively and both information dissemination related, i.e. IM aims to promote the dissemination while ND aims to block it, they are completely unable to swap their jobs. IM nodes are scattered through the networks while ND nodes are much more clustered. Actually, ND usually extracts a “core chain” to break a network into pieces rather than finding the so-called “weak nodes”.
Furthermore, we find that key nodes identified by Shapley centrality are practically a combination of core key nodes from NR, IM and ND algorithms. SHP nodes consist of two parts, one is nodes that are relatively clustered in the high-degree areas like NR and ND nodes, and the other one is some high-degree nodes that are rather isolated as IM nodes. Therefore, SHP nodes could be applied to the three tasks simultaneously with good performances.
Footnotes
Acknowledgments
This work was supported by the Fundamental Research Funds for the Central Universities (No. 2022QN1093).
