Abstract
In recent years, the problem of influential spreader identification in complex networks has attracted extensive attention as its fundamental role in social network analysis, rumor controlling, viral marketing and other related fields. Centrality measures that consider different scales of neighborhood are commonly utilized for ranking node influence. The 2-hop neighborhood of the target node is deemed a suitable evaluation metric. However, as the network scale expands, only considering 2-hop neighborhood is overly restrictive. Furthermore, the interconnections among nodes are often disregarded. In this article, a new method named Limited Spreading Domain (LSD) is proposed to identify influential spreaders. LSD defines the target node’s 2-hop neighborhood as basic domain and takes the neighbors who are 3–6 hops away from target node as extended domain. The influence of target node is modeled as diffusion along the paths with limited length in basic domain and extended domain. A series of experiments are conducted in eight real complex networks and results demonstrate that LSD outperforms common centralities in terms of accuracy, stability,distinguishability and scalability.
Introduction
Spreading phenomena are ubiquitous in complex networks, such as the diffusion of viruses within crowds, the propagation of specific information across social media platforms, the traffic flow in cities, etc. Nodes with great influence often play a key role in expanding dissemination or turning public opinion [1]. Therefore it is essential to identify influential spreaders in order to discover opinion leaders, control rumors or infectious diseases, promote products and more. A series of methods have been proposed in the field of complex networks to discover nodes with strong spreading ability [2].
To discover influential nodes, the first step is to determine an appropriate method for estimating a single node’s spreading potential within the network. Centrality is a widely used metric for evaluation. The difference of various centralities, such as degree, semi-local centrality, closeness, primarily stem from the different scale of neighborhoods they consider. Consequently, their performance varies accordingly [3]. In [4], a centrality considering the number of neighbors within a 2-hop distance from the target node is concluded to be a good metric as it balances the efficiency and effectiveness. Ref. [5] develops a method to discover the optimal evaluation radius of the neighborhood for a given network. The aforementioned studies indicate that the performance of the method in estimating nodes’ influence is influenced by the scale of neighborhood considered. However, the 2-hop neighborhood may not always be the optimal choice. For example, the efficacy of centralities is contingent upon the probability of spreading [6]. Semi-local centrality and closeness centrality, which take into account a larger scale of neighborhood, are more suitable for large spreading probabilities [7]. There are other factors that can impact the performance of centralities, such as taking into account the interconnectivity among neighbors [8].
Inspired by above observations, we propose an effective method for estimating nodes’ influence and identifying influential ones in complex networks. Firstly, we hold the view that, in general, the influence of each node is primarily confined to its 2-hop neighbors, and it is imperative to accurately calculate this aspect of influence. We define the target node’s 2-hop neighborhood as basic domain. At the same time, we also consider the scenario where the influence of the target node spreads further through the propagation of certain neighbors within a 2-hop range. Therefore, we consider the nodes within a 3–6 hop range from the target node as extended domain. The influence of target node is comprised of basic domain influence and extended domain influence, which can be adjusted by a tunable parameter. The Susceptible-Infected-Recovered (SIR) model is employed to obtain the simulation results that serve as the ground truth of spreading. Experimental results in eight real complex networks demonstrate that the proposed method is more accurate and stable in estimating nodes’ influence and identifying influential ones. Our approach enables a more precise differentiation of nodes’ impact compared to common centralities. Our approach enables a more precise differentiation of nodes’ influence.
The main contributions of this article are as follows: A novel method for identifying influential nodes based on local structure is proposed, which defines the target node’s 2-hop neighborhood as basic domain and takes the neighbors who are 3-6 hops away from the target node as extended domain. Different evaluation details are adopted for local domain and extended domain to achieve a balance between evaluation accuracy and calculation time. Experiments in eights real complex networks show that the proposed method has better accuracy, discrimination and scalability with acceptable running time. It performs better under inhomogeneous spreading probabilities.
The rest of this article is outlined as follows: Section 2 presents a review of related research on influence evaluation. Section 3 gives a concise overview of some common centralities for comparison. In Section 4 we explicate the details of our approach. Section 5 focuses on the experimental results. Finally, in Section 6, we summarize the core research contents and draw conclusions.
Related work
The research fields of complex networks such as community detection [9–10], information dissemination [1], and identification of influential nodes [2], have been increasingly emphasized in recent years. The spreading ability of a node can be considered as its influence. Centralities, which indicate the topological significance of nodes, are naturally regarded as measures of their spreading ability. Degree is a widely used local centrality measure that quantifies the number of directly connected neighbors to a target node [11]. Although simple to calculate, it may be misleading in some cases due to the limited topology information considered [12]. The research findings suggest that expanding the scale of the target node’s neighborhood can be a feasible approach to improving the accuracy of identifying influential spreaders [13, 14]. Semi-local centrality, which expands the neighborhood scale to 4 hops, is a typical example. Other centralities based on degree consider additional network properties such as clustering coefficient [15], closeness among neighbors [16], structural hole [17], or edge weight [18].
Commonly used global centralities, such as betweenness [15], closeness [19], pagerank [20], and Leaderrank [21], consider the node’s position in the entire network and are effective in many scenarios. However, these measures have high computational complexity and require knowledge of the whole network structure [22]. Kshell centrality is a widely used and simple global centrality measure, but its poor distinguishability due to the limited number of rank values assigned to many nodes has been recognized [8, 23]. To address this issue and enhance its performance, various improvements have been proposed [24–26]. The concept of gravity formula and information entropy has been used more and more frequently in influence evaluation in recent years. Ref. [27] combines information entropy with kshell, which effectively enhancement the distinguishability ability of kshell. Gravity centrality is a novel centrality measure that integrates both local and global information of a node to accurately capture the interplay among nodes [28].
With the expansion of network scale, obtaining a complete network structure that global centralities need becomes increasingly challenging. In such circumstances, methods based on local structures are proposed as an alternative. Furthermore, certain local centralities, such as semi-local centrality, exhibit comparable or even superior performance to global centralities in complex networks.
Recently, it has been suggested that a neighborhood within 2 hops is an appropriate local topology scale for evaluating node influence. However, this conclusion may not always hold true as the spread of influence depends on various factors. When the spreading probability is large, the influence of the target node will propagate far beyond two hops. What’s more, connections among nodes should be considered. However, expanding the scale or considering more factors inevitably leads to an increase in computational complexity, which is not conducive to large-scale networks. Therefore, this article aims to propose a method that can balance accuracy and efficiency.
Preliminaries
In this section, some common centralities are introduced briefly, including degree, semi-local centrality, betweenness, closeness and kshell. As the 2-hop neighborhood of a target node is often crucial in many networks, we also define nearest and next-nearest centralities here. Two novel methods based on gravity and entropy are also introduced here. The complexity of all these methods is listed in Table 1.
The computational complexity of nine commonly used methods
The computational complexity of nine commonly used methods
Consider a complex network G = (V, E) that is undirected and unweighted, consisting of n nodes in V and m edges in E.
(1) Degree centrality (DC) captures the 1-hop neighborhood by quantifying the number of nodes directly connected to the target node [3].
(2) Nearest and next nearest neighbor centrality (NC) includes the 2-hop neighborhood of the target node. The definition is as follows:
(3) Semi-local centrality (SLC) incorporates a 4-hop neighborhood and is proposed as a trade-off between degree and global centralities [13].
(4) Betweenness centrality (BC) is a global metric that measures the importance of a node based on its ability to connect other nodes through the shortest paths. s st represents the shortest paths’ number between any two nodes except the target node. s st (u) means the nunber of shortest paths between all other nodes that node u is included [16].
(5) Closeness centrality (CC) quantifies the distance from the target node to other nodes [16]. It defines as follows:
(6) Kshell centrality (KS) is obtained by kshell decompositon [8]. Initially kshell decomposition method removes all the nodes with degree DC = 1. When the degree of nodes remained is all larger than 1, this step stops and all the removed nodes are assigned with kshell value KS = 1. Then kshell decomposition iteratively removes nodes with DC = 2 and assigns them with KS = 2 in a similar way. Repeately perform this procedure until all the nodes in G are assigned to one of the KS values.
(7) PageRank (PR) is a well-established algorithm for sorting web pages [20], which can be used to solve the node importance sorting problem in complex networks.
Where n is the number of nodes, N v is the number of v’s neighbors, k u is the number of nodes pointed by u.
(8) Gravity Centrality (GC) is proposed based on gravity formula: the gravitational force between objects is related to mass and distance [28]. In complex networks, the mass of node u can be represented as |N (u) |, which means the degree of u. The distance between node u and v can be expressed as d uv , which is the length of the shortest path connecting these nodes. is the average shortest distance of the network.
(9) Improved k-shell (IKS) divides the nodes into different shells according to the node information entropy [29]. The node entropy is defined as:
Where N (v) denotes the 1-hop neighbors of node v.
We already know that 2-hop neighborhood of target node is regarded as a good choice for measureing nodes’ influence in most networks [4], which implies that nodes with a higher number of neighbors within 2 hops exhibit greater potential for information dissemination. However, relying solely on the number of neighbors within 2 hops to measure nodes’ influence is too restrictive in rapidly expanding networks. The target node’s influence will spread far beyond 2 hops with the increase of spreading probability. Expanding the scope of evaluation can enhance ranking accuracy, it may also lead to an increase in computational complexity. Closeness Centrality, which assumes that a node’s influence spreads along the shortest paths to all the nodes it can reach, performs well at large spreading probabilities. However, its computational complexity is unacceptable for large-scale networks. In addition, the connectivity of neighbors within a certain scale should also be taken into consideration, as it directly affects their probability of being activated by the target node. In short, our aim is to propose a method that can effectively expand the scope of evaluation, take into account the interconnections among nodes within this scope, while minimizing computational complexity.
Taking inspiration from the aforementioned opinions, we raise a method named Limited Spreading Domain (LSD for short). As illustrated in Fig. 1, this method defines the neighborhood within 2 hops of the target node as Basic Domain (BD for short). The influence of a target node u ∈ V within basic domain Inf BD (u) is estimated by summing up its activation probabilities to all u’s 1-hop and 2-hop neighbors as follows:

The influence evaluation scope of target node u. The area within the solid circle represents the basic domain, and the area within the dotted circle represents the extended domain.
If w is conncted to u directly, then Inf (pathu→w) = p uw . If w is connected to u by an intermediate node v, then Inf (pathu→w) = p uv p vw . The set of nodes directly connected to u is denoted as N (u), while the set of neighbors of node v excluding u is represented by N (v) ∖ u. p uv is the spreading probability from node u to node v.
In addition to basic domain, we also consider the neighbors of target node in a wider scope, which is referred to as Extended Domain (ED for short). Firstly we consider the scope that the neighbors can be reached in 3 or 4 hops from the target node u. The number of nodes located at a distance of 3 or 4 hops away from the target node is typically large, and accurately calculating their influence on other nodes when activated by the target node requires significant time investment. For the sake of simplicity, we regard their influence in the basic domain as their approximate influence. As a result, the extended domain encompasses the neighborhood within 3–6 hops from the target node. These remote nodes are typically influenced by the secondary spread caused by neighboring nodes of u. The nodes can be reached in 3 or 4 hops (the gray nodes in Fig. 1) are the neighbors or neighbors’ neighbors of node w, w is the 2-hop neighbor of target node u. Therefore node u’s extended domain influence
In summary, the calculation formula for target node u’s influence is:
λ is a tunable parameter. The influence ratio between BD and ED can be adjusted by λ, thereby impacting the evaluation results.
There are specific regulations that can be followed when making a selection for λ. Given the known spreading probability, we derive the baseline parameter
The selection of λ should be explored by varying the order of magnitude based on
Algorithm 1 presents the pseudocode for LSD method, where N (u) refers to all the 1-hop neighbors of node u and NN (w) refers to the set of w’s neighbors within 2 hops. Inf (u) represents node u’s total influence. Finally according to Inf (u), we get the ranking list.
According to Algorithm 1, LSD needs to traverse the neighbors within a radius of two hops for each node to compute its Inf BD (u), and this procedure costs (lines 1–4). is the averge degree of the network. The next step involves traversing nodes within a 2-hop range to identify the neighbors’ neighbors of node u, and the complexity of this part is still . For each node w ∈ N (v) (line 11), the complexity of line 12 and line 13 is O (1) owe to the pre-stored value of NN (w) and Inf BD . Hence the complexity of lines 5–17 is , and the total complexity of LSD amounts to . The complexity of LSD is the same as SLC, but less than BC and CC which is O (n3).
The network characteristics of eight real networks
Experimental environment and datasets
The experimental hardware environment consists of an Intel(R) Core(TM) i7-6567U CPU with 7.58 GB of memory, while the software environment is Matlab R2013a.
In this article, LSD is compared to the centralities introduced in Section 3 in eight real complex networks. The detailed topological characteristics of these networks are summarized in Table 1. The characteristics include node number n, edge number m, average degree , clustering coefficient C, average shortest distance , assortative coefficient r, degree heterogeneity H (defined as) and epidemic threshold
SIR model
Due to the difficulty in tracking the actual process of influence spreading, simulations based on spreading models are generally used to determine the influence of a specific node or set of nodes. The SIR (Susceptible-Infected-Recovered) model is a well-known epidemic model that has also been widely applied to simulate the diffusion of information or influence, particularly in the realm of node influence evaluation. Therefore, the SIR model is utilized in this article, and the outcomes produced by numerous simulations based on SIR typically serve as the benchmark for evaluating different methods.
In the initial stage, most nodes are in Susceptible (S) state except for the target node, which is in Infected (I) state. When the spreading starts, at each timestamp, each node in state I will visit its 1-hop neighbors one by one and identify the status of these neighbors. Upon encountering a neighbor in status S, it will attempt to infect this neighbor with probability p. p represents the probability of successful infection, which is also called spreading probability. The infected nodes then enter Recovered (R) state with a probability μ. The value of μ is explicitly assigned as 1 for the sake of simplicity. The recovered nodes are immune to both infection and transmission. This setting of μ is also more consistent with the characteristics of information or influence dissemination, as this process typically does not recur after message reception or influence exertion. Recovered nodes will neither be infected by other nodes nor infect others. The process terminates when no new infections occur in the network.
In order to reduce uncertainty, the entire process needs to be simulated for many times. In our experiments, we conduct 5000 simulations for each node in the network, and the influence of each node is determined by the average number of nodes in the final state R for all the simulations.
The spreading probability p in SIR model exerts a profound impact on both the spreading speed and scope. It is necessary to pre-assign a p value to each edge within the network. A typical approach is to assign a uniform value manually to all edges. Small values are typically utilized to avoid wide spreading caused by a high probability of diffusion, which renders the origin irrelevant [8]. Refer to general practices we set p ∈ (0, 0.1] in our work. It cannot be overlooked that each network has a specific value, known as the epidemic threshold
Evaluation metric
(1) Kendall coefficient τ
Kendall coefficient τ is usually used to evaluate the similarity of two sorted lists with the same length. It is widely adopted to measure the accuracy of various influence evaluation methods. Consider two lists X and Y, which denote node lists ranked by a specific centrality and SIR simulations respectively. The rank vlues of node i and j in X are x i and x j , and in Y are y i and y j respectively. For pairs (x i , y i ) and (x j , y j ), if (x i - x j ) (y i - y j ) > 0, then they are considered as concordant. if (x i - x j ) (y i - y j ) < 0, they are considered as discordant. The number of concordant and discordant pairs are represented with n c and n d respectively. Then τ is calculated as:
(2) M monotone function
M monotone function [37] is often used to reflect the measure’s distinguishability. The calculation method is as follows:
where R represents the ranking list obtained by a certain measure, r represents a centrality value in the ranking list, n is still the nodes’s number of the network, n r is the occurrence number of element r in the ranking list. The value range of M is between 0 and 1, where M = 1 implies that the network’s nodes can be distinguished totally, and M = 0 implies that the centrality value of each node is the same, and cannot be distinguished in any manner.
(1) Parameter λ
According to the value of

The Kendall coefficient τ values between ranking results produced by LSD with different λ and the ranking results produced by SIR model in eight real networks at various spreading probabilities p. In each of the figures, the epidemic threshold
(2) Effectiveness in ranking the nodes
The effectiveness of LSD is evaluated by comparing its τ values with those of DC, NC, SLC, BC, CC, KS, GC and IKS under various p. The parameter λ in LSD is set to be 100 in all the networks as we have found that λ = 100 provides the optimal result on most spreading probabilites.
Based on the results shown in Fig. 3, it can be concluded that LSD outperforms other centrality measures under all spreading probabilities in Netscience, Blog, Router and Amazon. In Email, Lastfm, PGP, the situation is slightly complicated. When p is small (

The Kendall coefficient τ values between ranking results produced by DC, NC, SLC, BC, CC, KS, GC, PR, IKS, LSD with λ = 100 and the ranking results produced by SIR model in eight real networks at various spreading probabilities p. The epidemic threshold
(3) Identifying influential nodes
In many cases, nodes with great influence have more important values. Therefore, we assess the efficacy of LSD to identify top-L (L varing from 1 to 100) influential nodes. The average number of nodes infected by top-L influential nodes which are measured by DC, NC, SLC, BC, CC, KS, PR, GC, IKS LSD (λ = 100) in eight real networks when

The average number of nodes infected by top-L influential nodes which are measured by DC, NC, SLC, BC, CC, KS, GC, PR, IKS and our method LSD (
(4) Distinguishability
In this part, we aim to compare the distinguishability of various methods. Distinguish-ability of a method mainly depends on whether this method can distinguish the influence of different nodes. If multiple nodes are assigned the same rank value, they will be deemed to have same influence.
Table 3 presents the M values of lists ranked by DC, NC, SLC, BC, CC, KS, PR, GC, IKS, LSD (λ = 100) when
The M values of lists ranked by DC, NC, SLC, BC, CC, KS, GC, PR, IKS and LSD (λ = 100) when
In order to further examine the distinguishability of each measure in more detail, we show the frequency distribution of ranks generated by DC, NC, SLC, BC, CC, KS, PR, GC, IKS and LSD in Fig. 5. Frequency of a rank refers to the number of nodes that are assigned this rank by the centrality measure. The poor performance of KS in each network is not unexpected due to its limited range of rank values. Many nodes are assigned identical KS values, rendering their influence indistinguishable. DC also suffers from this issue, which results in the influence measured by DC is not well distinguished. SLC and LSD have more rank values which implies that they can distinguish nodes’ influence better. Especially in SLC and LSD, the frequency of top-ranked nodes is 1, which indicates these nodes are seperated completely. It should be noted that LSD outperforms SLC in distinguishing nodes at larger rank values, indicating its superior performance under such conditions.

The frequency distribution of ranks generated by DC, NC, SLC, BC, CC, KS, GC, PR, IKS and LSD (λ = 100) when
(5) Effectiveness under inhomogeneous spreading probabilities
In practice, the strength of relationships between nodes varies according to their level of intimacy, thus affecting the spreading probability between them. LSD takes into account the different probabilities when calculating the spreading paths. To verify the performance of LSD under inhomogeneous p, we compared the τ values of LSD with other methods in Table 4. The specific approach involves assigning a randomly generated value ranging from 0 to 0.1 to each edge within the network, which represents the spreading probability between these two nodes in SIR model. Table 4 reveals that LSD exhibits the much larger τ value than all the other methods in all datasets except Delicious. In Delicious, LSD performs slightly worse than SLC. This experiment demonstrates that the LSD method has significant advantage in networks with inhomogeneous spreading probabilities.
The τ values of all the methods in networks with inhomogeneous spreading probabilities
(6) Running Time
In small-scale datasets, there is no significant difference in the running time for each method. Moreover, low complexity algorithms such as DC, KS, GC and IKS exhibit relatively shorter running times across different datasets. Therefore we only compare the running time of SLC, BC, CC, LSD in four larger datasets. Fig. 6 shows the running time of these methods. The computational time of LSD is demonstrated to be comparable to that of SLC in Fig. 6, as they share the same level of time complexity. However, LSD runs slightly longer than SLC. In contrast, BC and CC have higher time complexity than SLC and LSD, resulting in much longer computational time.

The running times of SLC, BC, CC, LSD in Lastfm, Amazon, PGP and Delicious.
Identifying influential spreaders is a fundamental task in the field of complex networks. Centralities with different scale of neighborhood are proposed successively to evaluate the nodes’ capacity of spreading. Among them centrality that considers the number of 2-hop neighborhood is considered to be a good metric as its good measurement effect and moderate complexity. When evaluating influence, it is important to consider not only the number of neighbors but also their connectivity. In many cases, accuracy can be improved by taking a broader scope of neighbors into account. We propose the LSD method to identify influential spreaders by considering the 2-hop neighborhood as the basic domain and selecting neighbors within a 3–6 hop range from the target node as an extended domain. By adjusting λ, the ratio of basic domain influence and extened domain influence could be adjusted to to suit different spreading characteristics at varying probabilities. We conduct experiments on eight real networks and demonstrate the rules of discovering the optimal λ. Our results show that, compared to DC, NC, SLC, CC, BC, KS, PR, GC and IKS, LSD can more accurately rank node influence under most spreading probabilities and identify top influential nodes in these networks. We evaluate the distinguishability of various centralities using both M function and rank frequency distribution, and our findings demonstrate that LSD exhibits significant advantages. In addition, we also demonstrate that LSD has significant advantages under the inhomogeneous spreading probabilities, and the running time is acceptable.
In the future, we will conduct experiments on more networks and explore the effect of different characteristics of network, such as degree distribution, clustering coefficient or community structure, on the centralities with different scale of neighborhood. What’s more, the performance of different centralities under different spreading models will be further discussed.
Footnotes
Acknowledgments
This work was funded by Scientific Research Program of Tianjin Education Commission (2021SK141).
