Abstract
The research of complex networks has become a hot research field, and has been successfully applied in many fields such as social network analysis, protein network analysis, and link prediction. In this paper, the traditional genetic algorithm has strong randomness and weak search ability in the community identification method. A complex network community recognition algorithm based on local optimization genetic algorithm is proposed. The method uses label propagation strategy to produce a certain precision. The initial population; a combined one-way crossover strategy is proposed to avoid destroying the original excellent community structure in the cross process; the local heuristic mutation strategy is used to ensure the accuracy of the identification community, and the overall optimization ability of the algorithm is realized. Finally, the effectiveness of the algorithm is verified by experiments.
Introduction
At present, the research of complex networks has become a hot research field, attracting the attention of researchers from computer, mathematics, biology, physics and sociology. In recent years, scholars in different fields have proposed many kinds of classic community identification methods. For example, the algorithm based on split GN [1] is only applicable to small complex networks, CPM algorithm based on faction penetration [2], LC algorithm based on edge clustering [3], LPA algorithm based on label propagation [4], based on seeds The extended LFM algorithm [5], the BGLL algorithm based on modularity optimization [6] and the FN algorithm based on local modularity optimization [7]. At present, the community recognition method based on modularity Q (Modularity) has become one of the mainstream methods in the field of community identification. Such algorithms actually transform the community recognition problem into a function optimization problem. However, the Q function has the problem of not being able to identify the resolution limit of a very small community structure. The research shows that solving the maximum module degree problem is a NP complete problem. Therefore, genetic algorithm, as the most effective method to solve NP problem, has been widely applied to the field of community identification. For example, the LPA algorithm was proposed by Zhu et al. in 2002. It is a graph-based semi-supervised learning method. The basic idea is to use the tag information of the marked nodes to predict the tag information of the unmarked nodes. The advantage is that the operation efficiency is high, and the time complexity of the algorithm is O (n), where n is the number of edges in the network. The disadvantage is that there is a defect that the randomness is too high, which affects the quality of community identification. The CDGA algorithm proposed by Tasgin et al. [8] proposed a one-way crossover strategy to maximize the optimal community structure in the individual, and combined the neighboring node initialization population and the random mutation strategy to complete the community identification. The GANET algorithm proposed by Pizzuti [9] introduced the graph-based coding method into the field of community identification, and adopted the strategy of uniform crossover, random variation and elite selection to complete the community identification task. He Dongxiao et al. proposed CCGA algorithm, using string encoding as the encoding method of the algorithm, using Markov random walk method to initialize the population, combined with Q function to propose clustering fusion crossover strategy between multiple bodies, and using local The mutation strategy completes the community identification task. The MEME-NET algorithm proposed by Gong et al. [10] further processed the excellent individuals obtained by the genetic algorithm through the hill climbing algorithm to obtain a better community structure. Based on the genetic algorithm, Shang et al. [11] further optimized the excellent individuals through simulated annealing algorithm to obtain a better community structure. Based on the CDGA algorithm, Shi et al. introduced a single-way crossover operation based on string coding strategy to the graph-based coding strategy to design a genetic algorithm that can deal with large-scale networks [12]. The above mentioned algorithms are based on the community identification method of genetic algorithm, but there are still some defects such as slow convergence rate and easy to fall into local optimal solution, which affects the quality of community identification. The analysis is due to the randomness of the algorithmic operation and the use of destructive crossover and mutation strategies. However, genetic algorithms are not suitable for large-scale complex networks. The community identification problem based on genetic algorithm should find a suitable coding method, the coding method should be simple, and it is suitable for the operation of mutation and crossover operator, and need to pay attention to the design of the operator, the initial population and the elimination mechanism. At the same time, the design of mutation and crossover operation should consider ensuring that good genes can be transmitted to the next generation. Consider the diversity of the next generation.
From the above analysis, we can see that it is very important to explore the initial population method with higher accuracy and reasonable crossover and mutation strategy when using genetic algorithm to accomplish community identification task. Based on this, this paper proposes a complex network community recognition algorithm based on local optimization genetic algorithm. First, the use of tag propagation for population initialization improves the accuracy of initializing individual populations. Secondly, in view of the traditional cross-cutting strategy, it is easy to destroy the defects of the existing excellent community structure, improve the single-way crossover strategy, and propose a combined one-way crossover strategy. Finally, a local heuristic mutation strategy is proposed for the problem that the traditional mutation strategy has greater randomness and weak local optimization ability.
Community identification based on partially optimized genetic algorithm
The choice of objective function
The module Q-function is a quantitative index to evaluate the quality of the network community structure. At present, this index has been generally accepted by researchers in the field of community identification. Therefore, Q function is taken as the objective function of POGA algorithm. The main idea of Q function is to evaluate the difference between the size of the community structure and the stochastic network whose node degree is stable. If the Q function value is larger, the closer the internal connections in the community are, the higher the community structure quality is. On the contrary, the lower the quality of community structure, the module function
where
The average conductivity function is to evaluate the loose connection between communities. Its function is defined as follows.
Let
where
When using genetic algorithm to accomplish the task of community identification, the solution of the problem is transformed into the chromosomes or individuals in the genetic algorithm through coding operations. Coding operation is the basis of genetic algorithm to deal with the problem, is an individual phenotype conversion to genotype process, which determines the individual’s genotype through which decoding mode conversion to phenotype, while the subsequent crossover, mutation and other genetic operations The choice also has important implications. In complex networks, an individual’s code is an array or string of bits used to represent some sort of result, also called chromosomes. The position of a gene in a chromosome is called a locus or a gene and it also means that one of the complex networks Node, chromosome corresponds to a division of the complex network, the chromosome solution space corresponding to all possible methods of division, from the resulting solution space corresponding to the chromosome called coding; from one of the mapping of chromosomes into the solution space is called decoding. Currently, the coding methods used in community mining algorithms based on genetic algorithms are based on string-based coding [14] and on the basis of genetic code [15]. In this paper, we use a coding method based on the position of a gene to represent an individual in a population composed of several network communities, that is, an individual code represents the result of dividing an online community.
Initialize the population
Only when each individual in the initial population has a certain degree of accuracy is it possible to give full play to the advantages of genetic algorithm optimization. Therefore, based on the idea of label propagation, an iterative label propagation method is used to generate the initial population. The idea of label propagation is that in a complex network, each node should belong to the same community as most of its neighbors, which can be expressed as Eq. (3):
where
When using genetic algorithm to solve community detection problems, the cross-operation has some particularity. When cross-operation is performed, a group of closely related genes are crossed to the next generation instead of individual genes. The closely related genes represent a community. Therefore, you can’t use the traditional single point and multi-point cross method, but the use of single cross and multi-channel cross. The one-way crossover algorithm proposed by Tasgin et al. [8] in this paper is improved, which is called combined one-way interleaving. The aim of improvement is to make the new individuals after crossover retain the largest community structure characteristics of the intersection of two parents. The definition of the merged one-way crossover operation is defined as follows: Let
Mutation operation
For complex network community identification problems, genetic algorithms often use random variation, lack of necessary information for mutation operations, and lack of effective control over the evolutionary direction of mutation operations. As a result, individual evolutionary speed is slow and the optimization effect is not good. For the above reasons, genetic algorithms are difficult to apply to community identification of large-scale networks. In this paper, aiming at the complex network community identification process, a local dynamical heuristic function is constructed to inspire the evolutionary direction of mutation operation and thus improve the individual evolution speed and recognition accuracy. The POGA algorithm uses the conductivity function
From Definition 3, we can see that the relationship between node community identity change and its neighbor community internal degree. In mutation operations, a node is separated from the current community that contains it and then merges into a nearer neighborhood to the node. Analyzing this process from the viewpoint of network dynamics can be considered as the process of passing the community identifier from the neighbor community to the current node. I.e., the community identifier of node
where
From Definition 4, when
The local heuristic mutation algorithm based on
Algorithm: Local Heuristic Variation Strategy.
Input: Individual to be mutated
Output: Individual
1. for
2. List
3.
4. for
5.
6. if (
7.
8.
9. Change the position of
10. Return
The time complexity of the LHVS algorithm is
In order to preserve excellent individuals in genetic algorithms, a good population selection strategy is indispensable. Reference is made to the literature [13] for population selection. The basic idea is to select the
Procedure POGA
Input:
Output: C //The best online community structure
Begin
1. I
2. tempMaxQ
3. parentMaxQ
4. while tempMaxQ
5. parentMaxQ
6.
7.
8. tempMaxQ
9.
10. I
11. end while
12. C
End
In the POGA algorithm, the label propagation strategy generates a primitive population with a certain degree of precision to give full play to the superiority of the genetic algorithm. Adopt the combined one-way crossover strategy to avoid destroying the original excellent community structure and ensure the community structure to be more Excellent direction to optimize the use of local heuristic mutation strategy to ensure the accuracy of identifying communities, in order to achieve the overall optimization function of the algorithm.
Experimental results
Data set
The artificially generated benchmark networks in complex networks mainly include LFR benchmark, Newman benchmark and Ball-Karrer-Newman model. The real networks identified by the community are Karate, Dolphins, Polbooks and Football. Since the LFR benchmark [16] benchmark network is very similar to the statistical characteristics of the real network, this paper uses this benchmark network as the test data set of POGA. The related parameters are set as follows: the scale of the network is
Comparison of NMI values for each algorithm.
In order to validate the validity of the proposed algorithm, the POGA algorithm is tested on datasets of datasets and is compared with those based on split GN algorithm, FN algorithm based on local modularity optimization, label-based LPA algorithm, Genetic algorithm GANET algorithm, in community identification accuracy evaluation index NMI and community internal connection tightness evaluation index Q aspects of comparative analysis. Because the genetic algorithm needs to set the relevant parameters when it is running, the POGA algorithm also needs to set the general parameters of the genetic algorithm. Based on the reference [18] and the analysis, this chapter gives the parameter setting of the POGA algorithm. The population size
Comparison of Q values of various algorithms.
Figure 1 shows the comparison between POGA algorithm and other comparison algorithms in NMI. It is found that POGA algorithm can obtain higher NMI value and better community identification ability. Figure 2 shows the comparison of each algorithm in the module function Q. It can be seen that when the community results in the network are quite vague, but the module function obtained by the POGA algorithm is still very high, the POGA algorithm can identify the community Closely connected internal community structure. Through the analysis of each algorithm in LFR benchmark datum network, it can be seen that because of the randomness of GANET algorithm and LPA algorithm, the community recognition quality is affected. Therefore, the community structure is identified by these two algorithms not ideal. Although the GN algorithm has a higher quality of community structure identified in a network with a clearer community structure, its recognition performance is relatively weaker in a network with a fuzzy community structure. Although the FN algorithm obtains higher NMI and Q values in identifying communities, it is still lower than the POGA algorithm. Therefore, the proposed POGA algorithm has better NMI and Q than other comparison algorithms.
The complex network community identification method based on local optimization genetic algorithm proposed in this paper selects the modular Q function and the average conductivity function as the objective function, adopts the coding method of gene position adjacency, and uses the label propagation strategy to generate the original population with certain precision. Take advantage of the genetic algorithm to optimize the advantages; adopt the combined one-way crossover strategy to avoid destroying the original excellent community structure in the cross process and ensure the community structure to optimize in the better direction; use the local heuristic mutation strategy to ensure the identification of the community The accuracy rate, thus achieving the overall optimization function of the algorithm. Tested in the benchmark network, the experimental results show that the POGA algorithm is effective and feasible.
Footnotes
Acknowledgments
The study was supported by the Science and Technology Innovation Project of Shanxi Higher Education Institution (No. 201804011).
