Abstract
Influential node identification plays an important role in optimizing network structure. Many measures and identification methods are proposed for this purpose. However, the current network system is more complex, the existing methods are difficult to deal with these networks. In this paper, several basic measures are introduced and discussed and we propose an improved influential nodes identification method that adopts the hybrid mechanism of information entropy and weighted degree of edge to improve the accuracy of identification (Hm-shell). Our proposed method is evaluated by comparing with nine algorithms in nine datasets. Theoretical analysis and experimental results on real datasets show that our method outperforms other methods on performance.
Introduction
Complex networks indicate complex systems in the real world which is a structure model. Through analysis of complex networks, some important characteristics and rules will be mined and obtained. So complex networks analysis [1] has been widely used in informetrics, sociology, biology and software application, etc. Identifying the influential nodes is of importance due to effective application in complex networks. If these few important nodes(influential nodes) are deliberately damaged, the network may suffer serious losses, thus affecting people’s normal life. In practical complex networks such as social networks, traffic networks, and power networks, the influence nodes have been playing a vital role as the key nodes in the network. Nowadays, there are several types of metrics: local metric, global metric, location metric, and random-walk metric [2, 3]. The local metric includes typical degree centrality and similar metric [4, 5], et al., Due to neglect of the global structure, the local metric usually is unable to accurately evaluate the influential nodes. A global metric can overcome the fault of local metrics and evaluate the influential nodes from a global view. Typical global metric has betweenness centrality [6], closeness centrality [7], H-index [8], Bi-directional h-index(BDH) [9], Managing consensus based on leadership(MCL) [10], while the metric has high computing complex. Another metric is location metric, e.g typical metric K-shell [11] which is a fast node ranking method. The last metric type is random-walk metric, e.g Page Rank [12], HITS [13] and LeaderRank [14]. Hao W et al. [15] proposed an effective method based on network representation learning. The proposed method takes into accord not only the identification of overlapping communities in complex networks but also the factor of network topology and construction. Wen T et al. [16] proposed a novel fuzzy local dimension to rank the influential nodes in complex networks, where a node with a high fuzzy local dimension has a high influential ability. This proposed method focuses on the influence of the distance from the center node on the local dimension of the center node by a fuzzy set, resulting in a change in influential ability.
Inspired by the above methods, we propose an improved influential nodes identification in the complex network by adopting a hybrid mechanism based on information entropy and weighted degree of edge. Besides monotonicity [21], network efficiency [22] and correctness [21] are used to evaluate the effectiveness of the proposed algorithm. The experiments on nine real networks results indicate that our proposed algorithm performs better than seven other methods.
The contributions of this paper are summarized below: Several basic measures are discussed. Based on the k-shell method, we proposed an improved k-shell method which adopted a hybrid method of information entropy and weighted degree of edge. Our proposed method is evaluated by comparing with nine algorithms in nine datasets.
The paper is structured as follows: Section 2 gives out the introduction and overview of the influential node identification; Section 3 proposes the improved influential nodes identification method. Experimental results are shown in Section 4; section 5 concludes the paper and pointed out that the focus of future work.
Preliminaries
In this section, the basic measure is surveyed and the advantage and disadvantages of measures are discussed including degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, K-shell, PageRank, HITS, LeaderRank.
The degree is the simplest measure method to evaluate the influence of a node on other nodes. Suppose D called degree, then D (i) indicates the degree of node i. D (i) is denoted by Equation (1).
Freeman et al. [6] proposed the betweenness centrality to evaluate a single node.
Where g
jk
refers to the total number of shortest paths, g
jk
(i) indicates the number of shortest paths connecting jk passing through i. To compare with another metric effectively, C
b
(i) usually is normalized as
Where v equals to (n - 1) * (n - 2)/2 which represents the number of pairs of vertices excluding the vertex itself. The higher betweenness of the node is, the more influential a node is. Meanwhile, the node is more important.
Node closeness represents the influence ability of a node on other nodes. The closeness centrality is interpreted as a metric indicating how long it will take to spread information from a node to all other nodes [19]. Closeness indicates the degree of close between a node and other nodes in the network. It is an evaluation metric based on the length of the average shortest path between a node and all other nodes in the graph. The formula to compute this metric is as follows,
Where d
ij
refers to the shortest distance from node i to j. Cc (i) also is normalized as
A central node should be one node connected to powerful nodes. The importance of a node depends not only on the number of its neighbors (that is, the degree of the node) but also on the importance of its neighbours. Paper [20, 21] proposed that eigenvector centrality was an important metric to evaluate the significance of the node. Eigenvector centrality is an evaluation metric of the influential nodes in the complex network. According to the mechanism that connections to high-scoring nodes contribute more to the score of the node than equal connections to low-scoring nodes, it assigns relative scores to all nodes in the network. Google’s PageRank is a modification of the eigenvector centrality. Due to the defects of undirected cyclic graphs [22], another closely related centrality measure Katz centrality is proposed to adopt the adjacency matrix to discover eigenvector centrality.
For a given graph G : = (V, E), |V| is number of vertices, let A = (av,t) be the adjacency matrix, i.e. av,t = 1, if vertex ν is connected to vertex t, and av,t = 0. The centrality score of vertex ν can be indicated by Equation (6).
Where M (v) is a set of the neighbors of ν, and λ is a constant. With a small rearrangement, this can be rewritten in vector notation as the eigenvector equation.
Generally speaking, many different eigenvalues λ exist an eigenvector solution. However, the additional requirement that all the entries in the eigenvector be positive implies (according to the Perron–Frobenius theorem) that only the greatest eigenvalue results in the desired centrality measure [6]. The v th component of the related eigenvector represents the centrality score of the vertex v in the network. Power iteration is a typical eigenvalue algorithm that can discover this dominant eigenvector. In addition, this can be generalized so that the entries in A can be real numbers representing connection strengths, as in a stochastic matrix [22].
The k–shell disintegration for a graph can be carried out iteratively. The degree of a node is defined as the number of arcs incident on a given vertex. Nodes with the same degree have the same k–shell index. Starting with nodes of degree one, we cut all these nodes off and the links from the graph. Nodes with degree one in the reduced graph are assigned k–shell index as one (that is, k = 1) and recursively cut off. Next nodes with degree two are considered and the same is done for k = 2 and so on until all nodes are cut off from the graph.
PageRank is firstly used in the google search engine to rank web pages [23], afterwards, it is used to rank nodes in social networks [24–26]. It should satisfy the following equation,
Where L (j) = ∑ i α ji is the number of neighbors of node j (or the number of outbound links in a directed graph). Different from eigenvector centrality and Katz centrality, PageRank has the scaling factor L (j). In addition, the PageRank vector is a left-hand eigenvector (note the factor a ji has indices reversed).
HITS is an important link analysis algorithm like PageRank to rate web pages. While it is different from PageRank, HITS only can deal with small-scale web pages. Hubs and Authorities are main parameters that originated from a particular insight into the creation of web pages when the Internet was originally forming; that is, some web pages known as hubs were not authoritative in the information that they held, but it can let users direct to other authoritative pages. In other words, a good hub indicated a page pointed to many other pages, and a good authority represented a page was linked by many different hubs. The higher is the authority value of the web page, the higher is hub value. The higher is the hub value that points to the web page, the higher is its authority value. The variables of the two are balanced.
Recently, Lü et al. [14] proposed the LeaderRank algorithm to identify influential spreaders in directed networks, which is a simple variant of PageRank. It introduces a ground node that connects to every user through bidirectional links. Mathematically, this process is equivalent to a random walk on the directed network and is described by the stochastic matrix p with elements
Since this paper is based on an improved K-shell algorithm, to further introduce the K-shell, Figure in the references [17, 27] is presented in this paper. As depicted in Fig. 2, a schematic network is represented in Ref [17]. Although k-shell decomposition efficiently detects the influential nodes, the nodes in the same shell are not distinguishable by the k-shell index. In Ref [27], the nodes between the two outer rings compose shell 1 (ks = 1), while the nodes between the two inner rings compose shell 2 (ks = 2). The nodes within the central ring constitute the core, in this case, ks = 3.

K-shell.

Zachary network structure.
As discussed above, it is necessary to identify the core nodes which have a huge influence. In [17, 28], the k-shell method is applied into the identification of influential nodes. Meanwhile, because it is a coarse-grained identification method resulting in low identification accuracy, to address the problem, we will propose an improved k-shell method to improve the identification accuracy. The traditional K-shell method considers only the value of the node degree, ignoring the node itself and the influence between nodes. So the hybrid mechanism of information entropy and weighted degree of the edge is adopted in this paper. The mechanism both considers inference of node and between nodes. The details are as follows:
In this section, entropy e is introduced to evaluate the important nodes. Assuming to compute the proportion of node i, it can be expressed by Equation (10).
Where p
ij
represents the proportion of the node i in the j index. a
ij
indicates the value of the jth index of the ith object. According to the above p
ij
value, the information entropy e
j
for ith node will be given by Equation (11).
Where k= 1/ln(m), then the information utility value is expressed as u
j
.
According to u
j
value, the weight of node j is computed and it is indicated by Equation (13).
In addition, because the importance of neighbour node will impact the node itself, it should be considered. Let Ni and Nj respectively expressed as neighbour of node i and j. When Ni and Nj equal, which shows node i and j have the same neighbours. It can be defined as Ni ∩ Nj. The whole neighbourhoods of node i and j are represented as Ni ∪ Nj. The influence Coefficient of the edge is introduced in the paper and it is set as Iij which is expressed by Equation (14).
So the evaluation of edge should be added in the k-shell method. Let we (i) as the weighted degree of edge.
Where Iij represents the Influence Coefficient of edge for node i and j. ks(j) indicates the number of layers for node j through the k-shell method. w
ij
shows the sum of the degree of node i and j. w
j
represents the weight of node j.
Where d i and d j are respective show degrees of node i and j.
The pseudo-code of the function is shown in Algorithm 1.
In Algorithm 1, w j denotes the weight of node j. ks(j) is the number of layers for node j. wij indicates the sum of the degree of node i and j. Iij refers to the influence coefficient of the edge between node i and j. By computing w j , ks(j), wij, and Iij of all nodes, we (i) is obtained. According to the we (i) value, the rank_list is generated.
In this section, firstly the evaluation metrics are given out and used to evaluate our proposed algorithm compared with the rank methods.
Evaluation metric
We introduce three evaluation metrics, monotonicity [17], network efficiency [18] and correctness [17].
First, the monotonicity M of the ranking vector R is used to evaluate the resolution of different ranking methods, it is denoted by Equation (17).
Where R refers to the ranking vector, n is the size of ranking vector R, n r is defined as the number of nodes with the same rank r.
Network efficiency is another metric to evaluate network connectivity. Suppose the network efficiency is defined as γ
Where 1/d
ij
indicates efficiency between node i and j, d
ij
represents the shortest path between node i and j. N is the number of nodes. φ is the decline rate of network efficiency, which is denoted by Equation (19),
Where γ0 is the initial efficiency of the network.
Correctness is another evaluation metric. The correctness of the ranking methods is evaluated to adopt the Kendall τ of two rank vectors R1 and R2 as the metric. The rank correlation τ is expressed as follows:
Where R1, R2 are rank vectors, n
c
and n
d
are the number of concordant and discordant pairs respectively. Here, n
t
, t1 and t2 is defined as follows:
In the section, to evaluate our algorithm, it compared with some algorithms, such as DC, BC, BCC, KS, KSC, MDD, GDLF, BDH, and MCL. The experimental tool python is adopted to develop. In this paper, we mainly from the following aspects of the simulation experiments:
The real-world datasets used in the experiments are shown in Table 1.
Properties of the real-world datasets used in the experiments
Properties of the real-world datasets used in the experiments
(1)
Firstly, monotonicity is selected as an important parameter factor to evaluate the effectiveness of the proposed algorithm. The experimental results are presented in Table 2. It can be seen that the KS method is worst about monotonicity in all algorithms. The reason why that too many nodes are assigned the same KS value. DC method only considers the degree of the node. The experimental results are shown in Table 2.
M values for different methods on different datasets
(2)
Network efficiency is an important basis to evaluate network performance. The simulation results of the network efficiency of eight algorithms are shown in Fig. 3. It can be seen from Fig. 3 that with the increasing β, the network efficiency of eight algorithms was decreasing gradually. The algorithm proposed in this paper had the maximum γ value, which indicated that the algorithm in this paper owns the best network efficiency. DC method is the worst in all algorithms. The main reason is that our method uses the hybrid mechanism to improve the k-shell algorithm which can result in too many nodes in the same ks value.

Performance evaluation for network efficiency.
(3)
The correctness is also an important indicator to evaluate the network performance and reliability in a complex network. The comparison of the correctness of the eight algorithms is shown in Fig. 4.

Performance evaluation for the correctness.
It can be seen from Fig. 4 that with the increasing β, the correctness is most increasing gradually except dolphins dataset. The infection rate ranges from 0.1 to 0.5. The correctness of ten algorithms is ranked from the DC algorithm, GDLF algorithm, KSC algorithm, BCC, MDD, KS, BC, BDH, MCL, and then the algorithm proposed in this paper.
The identification of influential nodes in complex networks is a basic research problem. It is very important to identify the high influence nodes in networks. For example, it is often used to prevent cyber attacks, the spread of viruses on the Internet, the spread of infectious diseases among people, the spread of gossip in society. In the future, the effective identification of the influence of nodes will be particularly important, and its role in society will be growing.
Though many measures and identification methods have been proposed to identify the influential nodes, when faced with the current complex network, the existing methods are difficult to deal with and have low identification accuracy. To achieve the optimal identification results, this paper presented an improved k-shell method based on the information entropy and weighted degree of edge to identify the influential nodes. We can conclude some results through some experiments. Firstly, our algorithm can better evaluate the influential node. The experimental results indicated that our algorithm could guarantee the network-connected reliability and improved the network efficiency to some extent with relatively good stability and anti-interference. However, the proposed approach has a limitation in computational complexity. The next step is to apply the quantum artificial bee colony algorithm into the complex network and reduce the computational complexity of the Hm-shell.
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers for helpful comments which helped them improve the technical quality of the paper. This paper is supported by Key Scientific and Technological Research Projects in Henan Province (Grand No. 192102210125) and Open Foundation of State Key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) (SKLNST-2020-2-01).
