Research on complex network community discovery based on adaptive cross mutation operator

Abstract

Community discovery in complex networks has become a core issue in multidisciplinary cross-disciplinary research and has been successfully applied in many areas, such as social network analysis, protein network analysis, and link prediction. This paper proposes a genetic algorithm based on adaptive cross mutation operator for complex network community mining. The population is generated by establishing fitness value calibration and adaptive cross mutation operator, and a good individual is selected from it. An improved adaptive cross mutation operator is proposed to ensure the convergence of genetic algorithm and accelerate the generation of optimal solution while maintaining the diversity of population. Finally, experiments were carried out in multiple real networks to verify the stability and efficiency of the algorithm.

Keywords

Genetic algorithm complex network community discovery

1. Introduction

At present, many complex systems in the real world can be expressed in the form of complex networks by their connection modes. The components in a system can be regarded as the vertices in a complex network, and the connection relationships between components represent, for example, connected by many computers. Data exchange, internet, social networks of people connected by some kind of social interaction [1], the metabolic network formed by the chemical reactions connected with the consumption of chemical substances [2]. It can be said that mankind has lived in a world full of various complex networks. This kind of network of human society brings great convenience to people’s lives and also brings great negative impacts on human life, such as contagion. The rapid spread of disease, the rapid spread of rumors in the Internet and so on. Therefore, only by fully understanding the structural characteristics of various complex networks can mankind be able to fully grasp the operating mechanism in complex systems. All along, the research on complex networks has attracted scholars from different disciplines to explore and produce a large number of academic achievements. With the deepening of research on complex networks, people find that small networks are commonly found in complex networks [3], that is, the shorter average path length of both existing networks and the higher clustering coefficient of regular networks; Scale-free features, that is, the node degree distribution in the network conforms to the basic statistical characteristics such as the power distribution characteristics. The characteristics of community structure involved in this paper are another important characteristic of complex networks. Community can be understood as a set of nodes with similar functions or with the same ownership. Usually, there are nodes in a community that are closely connected and nodes in a community are connected Loose features.

The complex network community identification method studied in this paper is to reveal the real community structure in complex network. It is of great significance for analysing the characteristics of network topology, finding the hidden rules in the network and mastering the operating mechanism in complex systems. Researchers in various fields have conducted a comprehensive study of community identification in complex networks and related algorithms are emerging [4, 5, 6, 7, 8, 9]. The Tasgin algorithm initializes the population by randomly assigning community identifiers to each node, proposes a single-way crossover strategy for individual cross-operation, uses a random mutation strategy to complete the mutation operation, and selects the modularity function as the objective function to complete the community identification. task. The complex network community discovery algorithm based on clustering fusion genetic algorithm introduces the idea of cluster fusion in crossover operator and improves the traditional crossover operator of simple interchange gene. Large Complex Network Community Discovery Algorithm is based on Genetic Algorithm. The algorithm uses individuals in the population based on the coding of loci to reduce the complexity of discovering community structures and greatly reduce the search space. Social Network Community Discovery Algorithm is based on Genetic Algorithm. The algorithm identifies the connection-intensive node group and the sparse connection node group by optimizing the fitness function. The mutation operator in this method only considers the modification of the correlation between the actual nodes only. Because the existing community discovery algorithm based on genetic algorithm has some defects, such as large computation, high time complexity, slow convergence, and weak search ability. Therefore, it is not suitable for community mining of large-scale complex networks. Therefore, in view of these defects, this paper has improved the existing genetic algorithms in order to achieve fast and efficient community mining on large-scale complex networks.

2. Genetic algorithm based on adaptive cross mutation operator

In order to solve the problem of complex time complexity and slow convergence in complex network community mining methods, a genetic algorithm based on adaptive crossover operator is proposed to mine complex network communities. According to the improved adaptive crossover mutation rate, the good individuals can be selected from the group according to fitness value calibration and adaptive crossover mutation, and the convergence of genetic algorithm is guaranteed while keeping the population diversity. In addition it also helps to accelerate the generation of the optimal solution and improve the search efficiency of the algorithm.

2.1 Encoding

Due to the fact that community mining of complex networks can not be directly processed by genetic algorithms, we have to translate the solution of the problem into chromosomes or individuals in the genetic algorithm through coding operations. Coding operation is the basis of genetic algorithm to deal with the problem, is an individual phenotype conversion to genotype process, which determines the individual’s genotype through which decoding mode conversion to phenotype, while the subsequent crossover, mutation and other genetic operations. The choice also has important implications. In complex networks, an individual’s code is an array or string of bits used to represent some sort of result, also called chromosomes. The position of a gene in a chromosome is called a locus or a gene and it also means that one of the complex networks Node, chromosome corresponds to a division of the complex network, the chromosome solution space corresponding to all possible methods of division, from the resulting solution space corresponding to the chromosome called coding. The mapping of chromosomes into the solution space is called decoding. Currently, the coding methods used in community mining algorithms based on genetic algorithms are based on string-based coding [14] and on the basis of genetic code [15]. In this paper, we use a coding method based on the position of a gene to represent an individual in a population composed of several network communities, that is, an individual code represents the result of dividing an online community.

2.2 Initialize the population

When initializing a population, any one of the individuals in the individual selects one of its neighbors as an individual of their alleles to generate a population, which can greatly reduce the search space for community partitioning while allowing the initial solution space to be constant. To the optimal solution space close to speed up the evolutionary process.

2.3 Fitness function and selection operator

The evolutionary search process of genetic algorithm has its own unique advantages. It only needs to evaluate the candidate solution according to the fitness function, and does not depend on too much other information, and only determines the subsequent genetic operation according to the fitness function value. The fitness of an individual reflects the advantages and disadvantages of the community partitioning result it represents and can reasonably evaluate the partition of the community structure obtained by the algorithm. In order to quantitatively describe the structure of the network community structure is good or bad, this article uses the widely accepted network modularity function (Q function) as a fitness function of individuals in the group. Although the modularity function has some drawbacks, if the network size is relatively small and only contains some non-hierarchical, non-overlapping modular structures, the extreme degradation problem is not so serious, and the modular optimization method also performs well. If the appropriate weights are assigned to links in complex networks, the resolution limitations and extreme degradation caused by model optimization can be alleviated. The Q function is defined as the difference between the proportion of the actual number of connections in the network in the community and the proportion of the expected number of connections in the community in the network in the case of random connections. The calculation of the $Q$ function [16] is shown in Eq. (1).

Given an undirected right-of-freedom network $N(V,E)$ , suppose the point set $V$ is divided into several communities. If any node $i$ in the network has the label $r(i)$ and belongs to the community $c_{r(i)}$ , Q function can be expressed as:

$\displaystyle Q=\frac{1}{2m}\sum\limits_{ij}{\left({\left({A_{ij}-\frac{k_{i}k% _{j}}{2m}}\right)\times\delta\left({r(i),r(j)}\right)}\right)}$ (1)

Where $A=(A_{ij})_{n\times n}$ denotes the adjacency matrix of network $N$ , $A_{ij}=1$ if there is an edge connection between node $i$ and node $j$ , otherwise $A_{ij}=0$ ; for $u$ , if $\delta(u,v)$ , its value is 1; otherwise the value is 0; $k_{i}$ denotes the degree of node $i$ , which is defined as $k_{i}=\Sigma_{j}{A_{ij}}$ ; $m=\frac{1}{2}\Sigma_{ij}{A_{ij}}$ , which represents the total number of edges in network $N$ .

Under normal circumstances, the initial population may exist special individuals fitness value is very large. In order to prevent it from controlling the whole group and misleading the development direction of the group, the algorithm converges to the local optimal solution and its reproduction needs to be restricted. As the genetic algorithm converges gradually near the end of the computation, it is difficult to continue optimizing the selection because the fitness values of individual individuals in the population are relatively close, resulting in the swinging around the optimal solution. In this case, the individual fitness value should be enlarged to improve the selection Capacity, called the fitness value of the calibration, this paper proposes the use of Eq. (2) to calculate fitness values, which can be expressed as:

$\displaystyle Q^{\prime}=\frac{1}{Q_{\min}+Q_{\max}+\delta}(Q+\left|{Q_{\min}}% \right|)$ (2)

$Q^{\prime}$ is the fitness value after calibration, $Q$ is the original fitness value, $Q_{\max}$ is an upper bound worth fitness, $Q_{\min}$ is a fitness lower bound, $\delta$ A positive real number. If $Q_{\max}$ is unknown, it can be replaced by the maximum in the current generation or by the population so far. If $Q_{\min}$ is unknown, it can be replaced by the minimum value of the current generation or the population so far. The purpose of taking $\delta$ is to prevent the denominator from zero and to increase the randomness of the genetic algorithm. $|{Q_{\min}}|$ is to ensure that the fitness value after calibration does not appear negative. As can be seen from Fig. 1, if the difference between $Q_{\max}$ and $Q_{\min}$ is larger, the angle $\alpha$ is smaller, that is, the range of the fitness value after calibration is small, the gap between abnormal individuals is prevented, the algorithm is prevented from oscillating near the optimal solution. This can be based on the value of group fitness amplification or reduction, change to the selection pressure.

Figure 1.

Calibration of fitness values.

The selection operator is a global search operator in the genetic algorithm. In order to keep the best individual in each generation and accelerate the convergence of the algorithm, the $\mu+\lambda$ selection strategy preferred by the combinatorial optimization evolutionary algorithm is adopted in this paper. The $\mu+\lambda$ strategy is to jointly select the $\mu$ individuals with the highest fitness from the parent population (size $\mu$ ) and the cross-mutation generated sub-population (size $\lambda$ ), and use it as the next-generation parent population. The IGACD algorithm proposed in this paper evolves the T generation. After selecting $100\leqslant T\leqslant 200$ , the optimal solution of the population will be obtained. By decoding, the components of the optimal solution code are identified, and one component is usually considered as a community, Good community division.

2.4 Adaptive crossover and mutation operators

The choice of crossover probability $p_{c}$ and mutation probability $p_{m}$ in genetic algorithm parameters is the key to the behavior and performance of genetic algorithm, which directly affects the convergence of the algorithm. For the crossover probability $p_{c}$ , the larger the $p_{c}$ , the faster the new individual will produce. However, when the $p_{c}$ is too large, the possibility of destroying the genetic model is greater, so that individual structures with high fitness will soon be destroyed, but if the $p_{c}$ is too small, the search process will be slow and even stagnant. For the mutation probability $p_{m}$ , if $p_{m}$ is too small, it is not easy to produce new individual structure. If the value of $p_{m}$ is too large, the genetic algorithm becomes a pure random search algorithm. For different optimization problems, the need to repeatedly test to determine the $p_{c}$ and $p_{m}$ , it is very tedious, and it is difficult to find the best value for each problem. In this paper, the genetic algorithm to dynamically adjust the relevant parameters, known as adaptive genetic algorithm, manifested in the adaptive crossover and mutation probability, which is based on individual fitness value of the dynamic adjustment $p_{c}$ and $p_{m}$ . When the individual fitness values tend to be consistent, the values of $p_{c}$ and $p_{m}$ are appropriately increased, while the values of $p_{c}$ and $p_{m}$ are reduced when the fitness values of the population are relatively dispersed. Meanwhile, for fitness values above the average fitness of the population Individuals, using lower $p_{c}$ and $p_{m}$ values, instead used higher $p_{c}$ and $p_{m}$ values for individuals with fitness values below the population average fitness value. Therefore, adaptive $p_{c}$ and pm can provide the best $p_{c}$ and $p_{m}$ values relative to some optimal solution.

The adaptive genetic algorithm can guarantee the convergence of the genetic algorithm while maintaining the population diversity. The Eqs (3) and (4) can be used to dynamically adjust the probability of cross-mutation of individuals, which can be expressed as:

$\displaystyle p_{c}=\left\{{\begin{array}[]{l}\frac{k_{1}(Q_{\max}-Q^{\prime})% }{Q_{\max}-Q_{\min}},Q^{\prime}\geqslant Q_{avg}\\ k_{2},Q^{\prime}<Q_{avg}\\ \end{array}}\right.$ (3) $\displaystyle p_{m}=\left\{{\begin{array}[]{l}\frac{k_{3}(Q_{\max}-Q)}{Q_{\max% }-Q_{avg}},Q\geqslant Q_{avg}\\ k_{4},Q<Q_{avg}\\ \end{array}}\right.$ (4)

Among them, $Q_{\max}$ is the largest fitness value of the population, $Q_{avg}$ is the average fitness value of each population, $Q^{\prime}$ the larger fitness value of two individuals to be crossed, and $Q$ the fitness value of the individuals to be mutated. Here, as long as the set $k_{1}$ , $k_{2}$ , $k_{3}$ and $k_{4}$ (0, 1) interval value, then you can use to adjust the fitness value.

Adaptive genetic algorithm fitness value adjustment process has three cases. First, when the fitness value is lower than the average fitness value, it indicates that the individual is a poor performance individual, which can adopt a larger crossover rate and mutation rate. If the fitness value is higher than the average fitness value, it indicates that the individual has excellent performance and can obtain the corresponding crossover rate and mutation rate according to its fitness value. Second, when the fitness value is closer to the maximum fitness value, the crossover rate and mutation rate are smaller. Thirdly, when the fitness value is equal to the maximum fitness value, the value of the crossover rate and mutation rate is zero. These three methods of adjustment are more suitable for the group to be in the late evolutionary stage, but unfavorable to the early stage of evolution because the better individuals in the early stage of evolution are almost in a state of no change, while the fine individuals in this case are not necessarily the optimal ones. The optimal solution, which easily increases the possibility of evolution toward a local optimal solution. To this end, the crossover and mutation probability made further improvements, so that the maximum fitness of the population worth individuals cross rate and mutation rate is not zero, respectively, to $p_{c2}$ and $p_{m2}$ , after the above improvements, $p_{c}$ and $p_{m}$ value calculation. For the Eqs (5) and (6) below, which can be expressed as:

$\displaystyle p_{c}=\left\{{\begin{array}[]{l}p_{c1}-\frac{(p_{c1}-p_{c2})(Q^{% \prime}-Q_{avg})}{Q_{\max}-Q_{avg}},Q^{\prime}\geqslant Q_{avg}\\ p_{c1},Q^{\prime}<Q_{avg}\\ \end{array}}\right.$ (5) $\displaystyle p_{m}=\left\{{\begin{array}[]{l}p_{m1}-\frac{(p_{m1}-p_{m2})(Q_{% \max}-Q)}{Q_{\max}-Q_{avg}},Q\geqslant Q_{avg}\\ p_{m1},Q<Q_{avg}\\ \end{array}}\right.$ (6)

From the above analysis found that adaptive genetic algorithm to maintain population diversity, while ensuring the convergence of genetic algorithms.

2.5 Improved genetic algorithm framework

In the genetic algorithm, a good strategy of population selection is essential in order to retain good individuals. The good individuals produced by fitness value calibration and adaptive crossover mutation proposed in this paper can be selected from the group. According to the improved adaptive cross-mutation rate, this paper presents IGACD algorithm. Specific algorithm IGACD described as follows:

Procedure IGACD

Input: N, L, P,

\mu

\lambda

// N represents a complex network, L is the number of iterations,

P

is the

population size,

\mu

indicates the size of the parent population, and

\lambda

is the

sub-population size

Output: C // the best online community structure

Begin

P\leftarrow

the initial population is generated

2. For

i=1:L

P^{(\textit{old})}\leftarrow

calculate the individual’s optimal

Q

value

P^{(\textit{new})}\leftarrow\emptyset

5. while

P^{(\textit{old})}>P^{(\textit{new})}

g\leftarrow

the population

P

fitness value calibration

g\leftarrow

adaptive cross-mutation of population

P

P^{(\textit{new})}\leftarrow P^{(\textit{new})}\cup g

9. End

10.

P^{(u)}\leftarrow P\cup P^{(\textit{new})}

11.

P\leftarrow

choose the best

\mu

individuals from

P^{(u)}

12. End

14.

I\leftarrow

take

P

the most fitness-minded individuals

15.

C\leftarrow

decode I

End

In the IGACD algorithm, the initial population is first generated and then the network community structure is probed by iteratively performing adaptive crossover and $\mu+\lambda$ selection of three genetic operators. The calibration of fitness value is used to improve the selection ability, and the good individuals are selected by adaptive crossover mutation to get the optimal global optimal solution. Although the $\mu+\lambda$ selection operator does not act directly on a single chromosome, it achieves its global search function by allowing the chromosomes with high fitness to enter the next generation through the selection mechanism of “survival of the fittest, survival of the fittest”. Using adaptive cross-mutation strategy, the home communities of the nodes are considered in terms of the similarity between the nodes and the internal connections of the community and the connection degree between the nodes and the community, so as to ensure the accuracy of identifying communities.

3. Experimental results

To verify the validity of the proposed algorithm in this chapter, the IGACD algorithm is now tested on a real network dataset. In these networks, due to the real community structure of Karate, Dolphins, Polbooks and Football, in these four data sets, based on the split GN algorithm [10], FN algorithm based on local modularity optimization [11]. The LPA algorithm [12] for label propagation and the GANET algorithm [13] based on genetic algorithm are compared and analyzed in the aspects of NMI [16], which is the evaluation index of community identification precision and Q. Because the genetic algorithm needs to set the relevant parameters when it is running, the IGACD algorithm also needs to set the general parameters of the genetic algorithm. This chapter draws on the literature and after analysis, given IGACD algorithm parameter settings, which IGACD algorithm population size P is 100, the crossover rate of $p_{c}$ is 70%, the rate of variation of $p_{m}$ is 30%.

Table 1
Real network data set

Network	Nodes	Edges	Community
Karate	34	78	2
Dolphins	62	159	2
Polbooks	105	441	3
Football	115	613	12

Table 2

Comparisons of NMI accuracy of each algorithm on real network

NMI	Karate	Dolphins	Polbooks	Football
GN	0.5795	0.5541	0.5585	0.8780
FN	0.6923	0.5727	0.5306	0.7570
LPA	0.8369	0.5915	0.5306	0.8988
GANET	0.5996	0.6979	0.5930	0.6657
IGACD	0.9999	0.6532	0.6166	0.9269

Table 1 shows the real network data set. Table 2 shows that the community recognition accuracy of each algorithm on four known community structure datasets. It can be seen that in the four networks, IGACD is only in the Dolphins network, its recognition accuracy is slightly lower than that of the GANET algorithm. In other networks, the IGACD algorithm has significantly higher community recognition accuracy than other comparison algorithms. Table 3 shows the intra-community connection tightness of each algorithm on four known community structure datasets. Comparing the IGACD algorithm with other comparison algorithms, it can be seen that the GN algorithm needs to be analyzed globally to obtain the edge medium, and then to complete the community identification through splitting, so the efficiency of the algorithm is relatively low. Due to the strong randomness of LPA algorithm, the recognition quality of its community is affected. The Q value corresponding to the community structure identified by the FN algorithm is superior to the IGACD algorithm, but overall, the IGACD algorithm is superior to the FN algorithm. The GANET algorithm based on genetic algorithm, because of its destructive crossover strategy and mutation strategy with strong randomness in the process of identifying communities, the overall quality of GANET algorithm is not high. According to the comparative analysis of Q value of four real networks through each algorithm, IGACD algorithm has better community recognition ability. Since the population size and the number of iterations are constant, the complex network is usually a sparse network, so the time complexity of the IGACD algorithm is close to O(cn), where $c$ is a constant.

Table 3

The algorithm compares the Q values of 4 real networks

Q	GN	FN	LPA	GANET	IGACD
Karate	0.4013	0.3806	0.4021	0.3998	0.4172
Dolphins	0.5194	0.4898	0.5112	0.4824	0.5268
Polbooks	0.5168	0.5020	0.5157	0.4905	0.5265
Football	0.5996	0.5774	0.5976	0.5840	0.6046

To further illustrate the effectiveness of the IGACD algorithm, IGACD is now analyzed against the community structures identified by the four real networks. Karate is a network of karate clubs in a university in the United States built around a two-year observation and analysis of an American college karate club. There are 34 members in the club as nodes and the friendship between the members as the edge connecting 2 nodes. Due to disagreements, the club is divided into two new clubs centered on the principal (34 nodes) and coach (1 node). For the Karate network, the NMI value obtained by IGACD is close to 1, and the corresponding community structure is shown in Fig. 2. It can be seen that the IGACD algorithm accurately decomposes the Karate network into two communities.

Figure 2.

The community structure identified by IGACD for the Karate Network.

Through experiments on real networks, we can see that the NMI and Q values obtained by IGACD are all excellent, and the community structures obtained by IGACD for Karate, Dolphins, Polbooks and Football networks are also more reasonable. Therefore, it is verified that IGACD The rationality and feasibility of the algorithm. The size of the community in a large-scale real network is generally much smaller than the size of the entire network, so for a large-scale real network, the efficiency of the IGACD algorithm proposed in this paper is very high.

4. Conclusions

For the problem that the genetic algorithm community identification method is not strong, this paper proposes a genetic algorithm based on adaptive cross mutation operator for complex network community mining. According to the improved adaptive cross mutation operator, the convergence of the genetic algorithm is guaranteed while maintaining the diversity of the population. The accuracy of the node’s belonging community is further ensured, thereby improving the quality of community identification. In each stage of finding the optimal community structure, the IGACD algorithm strives to reduce the randomness of the algorithm and ensure the quality of the community. Finally, the experimental test is carried out in the real network, and compared with many classical algorithms. The experimental results show that the IGACD algorithm is stable, effective and feasible.

Footnotes

Acknowledgments

The study was supported by the Science and Technology Innovation Project of Shanxi Higher Education Institution (No. 201804011).

References

Maslov

Kim

and Zaliznyak

, Detection of topological patterns in complex networks: correlation profile of the internet, Physical A: Statistical Mechanics and Its Applications 333(4) (2004), 529–540.

Zachary

W.W.

, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33(4) (1977), 452–473.

Shimobayashi

and Hall

M.N

, Making new contacts: the mTOR network in metabolism and signaling crosstalk, Nature Reviews Molecular Cell Biology 15(3) (2014), 155–162.

Shang

R.H.

Bai

and Jiao

L.C.

, Community detection based on modularity and an improved genetic algorithm, Physical A: Statistical Mechanics and its Applications 392(5) (2013), 1215–1231.

Pizzuti

, A Multiobjective genetic algorithm to find communities in complex networks, IEEE Transaction on Evolutionary Computation 16(3) (2012), 418–430.

Gong

M.G.

L.J.

Zhang

Q.F.

et al., Community detection in networks by using multiobjective evolutionary algorithm with decomposition, Physical A: Statistical Mechanics and its Applications 391(15) (2012), 4050–4060.

Shi

Yan

Z.Y.

and Cai

Y.N.

, Multi-objective community detection in complex networks, Applied Soft Computing 12(2) (2012), 850–859.

Alamsyah

and Rahardjo

, Kuspriyanto, Community detection methods in social network analysis, Proceeding of the International Conference on Internet Services Technology and Information Engineering, Bogor, Indonesia, 2014, pp. 25–253.

J.H.

and Havens

T.C.

, Fuzzy community detection in social networks using a genetic algorithm, Proceeding of IEEE International Conference on Fuzzy Systems, IEEE, Beijing, China, 2014, pp. 2039–204.

10.

Girvan

and Newman

M.E.J.

, Community structure in social and biological networks, Proceedings of the National Academy of Science 9(12) (2012), 7821–7826.

11.

Newman

M.E.J.

, Fast algotithm for detecting community structure in networks, Physical Review E 69(6) (2004), 066133.

12.

Raghavan

U.N.

Albert

and Kumara

, Near linear time algorithm to detect community sturctures in large-scale networks, Physical Review E 76(3) (2007), 036106.

13.

Pizzuti

, GA-NET:A genetic algorithm for community detection in social networks, Proceeding of the 10th International Conference on Parallel Problem Solving from Nature, Springer, Dortmund, Germany, 2008, pp. 1081–1090.

14.

Newman

M.E.J.

, Scientific collaboration networks. II. Shortest paths, weighted networks and centrality, Phys Rev E 64(1) (2001), 016–132.

15.

Guelzim

Bottani

Bourgine

et al., Topological and causal structure of the yeast transcriptional regulatory network, Nature Genetics 31(1) (2002), 60–63.

16.

Danon

Diaz-Guilera

and Duch

, Comparing community structure identification, Journal of Statistical Mechanics: Theory and Experiment 2005(9) (2005), P09008.

Research on complex network community discovery based on adaptive cross mutation operator

Abstract

Keywords

1. Introduction

2. Genetic algorithm based on adaptive cross mutation operator

2.1 Encoding

2.2 Initialize the population

2.3 Fitness function and selection operator

3. Experimental results

Table 1 Real network data set

Footnotes

Acknowledgments

References

Table 1
Real network data set