Multi niche crowding genetic algorithm parameter tuning for molecular potential energy surface computation

Abstract

In this paper, we present a Meta genetic algorithm (GA) optimization approach to tune the parameters of the multi niche crowding (MNC) GA, previously used to successfully describe the potential energy surface of different molecular systems at the semi empirical level of theory. The optimization is performed using a second layer of the same algorithm for a set of molecules of different sizes. Several sets of parameters were found that lead to a high performance of the algorithm in terms of the number of the minima found. The variation of the parameters for each molecule is discussed. A relationship between the changes of parameters as a function of the number of degrees of freedom of the molecule has been established.

Keywords

Genetic algorithm multi-niche crowding parameter tuning potential energy surface

1. Introduction

Evolution algorithms (EA) are a family of algorithms inspired by the evolution theory in order to solve a variety of problems. They have been successfully applied to complex optimisation problems where more classical optimisation algorithms failed to produce reliable results. Among this class of EA, genetic algorithms (GA) were widely used in optimisation problems. In the perspective of chemical studies, GA were applied to various problems including geometries of transition metal clusters [1], geometries of molecular clusters [2], ligand docking [3], and molecular design [4].

In previous papers [5, 6, 7, 8, 9, 10, 11], we have used the Multi Niche Crowding (MNC) GA [12, 13], to satisfactory describe the potential energy surface (PES) of a wide set of molecular systems. In these studies, we have followed the recommendations of the author of the MNC GA in order to choose the best set of the algorithm input parameters. Although the chosen parameter values have led to locate the global minimum for all of the molecules studied, the niche count corresponding to some of the minima was quite low in some cases. While, this situation does not prevent the algorithm to locate the minima, it inspires us to think about optimizing the parameters of the MNC GA that are most suitable for the problem of PES computation.

In general, for any type of GA, the parameter setting can significantly influence the performance of the algorithm in terms of the quality of the results found. With this consideration in mind, various studies have been conducted in order to find optimal GA parameters. Bartz-Beielstein et al. [14] used a technique called Sequential Parameter Optimization (SPO). Their objective is to design the experimental plan prior to actually doing the experiments. Some other studies focused on evaluating the sensitivity of certain parameters, but limited to the study of the independent influence of parameters values on the fitness [15, 16].

Another approach to the parameter optimization is the technique known as Meta-GA, which consists of using a second GA on top of the first one in order to tune its parameters. In this case, the genome of the meta-GA encodes all the parameters of the GA we would like to optimize. This technique was proposed in some previous studies [17, 18, 19]. However, most of these studies were limited to optimizing one or two parameters and holding all the others constant.

In the special case of PES computations, we are only aware of very few studies that were conducted to asset the best set of GA parameters. Brodmeier and Pretsch [20] have examined the influence of population size, scaling function, mutation and crossover rates on a conformer search. Brain and Addicoat [21] have proposed a more general approach for optimizing all GA parameters. However, they only focused on finding global minimum of the molecules they studied.

In this paper, we focus on optimizing the parameters of the MNC GA previously used to successfully describe the PES of various molecular systems at the semi empirical level, using the AM1 [22] and PM3 [23] methods. In order to do so; we have used a second layer of the MNC GA as a meta-GA in order to find the best combinations of parameters that lead to the best results in term of the number of minima found for a set of molecules of different sizes.

2. Overview of MNC GA

The MNC GA [12, 13] is a GA for multimodal search. That is, it allows locating different minima for a fitness minimization problem such as the PES scan. We have used it previously to study the PES of different molecules. The algorithm was implemented in a package of programs interfaced with MOPAC [24] in order to optimize the heat of formation (HF) of the molecule studied which is the fitness function in this special case.

As described in detail in Ref. [5], in MNC, the crowding concept is used in both the selection and the replacement steps. The algorithm uses what the author calls “Crowding selection” instead of the traditional Fitness Proportionate Reproduction (FPR) technique. For a given individual $I_{i}$ of the population, the crowding selection chooses its mate $I_{j}$ from a subset of $C_{s}$ individuals (Crowding selection size) picked at random from the population. During the replacement phase, a policy called “Worst Among Most Similar” (WAMS) is used to chose the individual to be replaced with the resulting offspring. In order to achieve this goal, $C_{f}$ (Crowding factor) groups of $s$ (Crowding group size) individuals per group are randomly chosen from the population. These groups are called “Crowding factor groups”. From each of these groups, the individual that is the most similar to the offspring is identified. Among these $C_{f}$ individuals, that are candidates for replacement, the one who is the less fitted is replaced by the offspring. The combination of these techniques allows the algorithm to maintain stable subpopulations within different niches, maintain diversity throughout the search and to converge to different local minima. The MNC algorithm has also another very important characteristic: it does not require a prior knowledge of the search space.

3. Computational procedure

As previously described [5], a real encoding scheme was used to encode the genome of each individual (conformation). In fact, the real encoding gives better results for problems of continuous space [25]. A conformation is described by n dihedral angels ( $\varphi_{1}$ , $\varphi_{2}$ , …, $\varphi_{n}$ ) corresponding to the n degrees of freedom of the molecule. A combination of these angles represents a location on the PES. The crossover operator used is called “interval crossover” [13]. This operator generates only one offspring that is close to its parents. For each pair of parent genes $\varphi_{1}$ and $\varphi_{2}$ , the offspring’s gene is selected at random from the interval [ $\varphi_{1}-\varepsilon$ /2, $\varphi_{2}+\varepsilon$ /2], assuming without loss of generality that $\varphi_{1}<\varphi_{2}$ , if we use a real encoding as in our case. $\varepsilon$ is called the parameter of interval crossover. The mutation is applied to the offspring, generated by the crossover operation, with a probability $P_{m}$ . The mutation operation is performed by permutation of a couple of genes selected at random from the genes of the offspring.

In summary, the MNC GA is governed by seven different parameters: population size, crowding selection size ( $C_{s}$ ), crowding factor size ( $C_{f}$ ), crowding group size ( $s$ ), interval crossover parameter ( $\varepsilon$ ), crossover probability ( $P_{c}$ ) and mutation probability ( $P_{m}$ ).

In this study we have used a meta-MNC GA to optimize all these parameters except three of them: the population size, $P_{c}$ and $C_{f}$ . Our choice is justified by the directions provided by the designer of the algorithm. According to Cedeno [13], changing the crossover probability does not seem to have a significant effect on the overall results of the algorithm. We thus decided to set a constant value of 1.0 to this parameter. In the same study, the author has reported that smaller values of the population size required more number of generations to be able to locate all the minima. In contrast, for higher values of the population size, a less number of generations is required to locate all the minima. For this parameter, we have used a value of 300 for all the molecules studied, which we believe is more than enough to be able to locate all minima in a reasonable number of generations. In this case we fixed the number of generations to 50 in order to reduce the computer burden. As for the parameter $C_{f}$ , the magnitude of possible values is quite low. The author has evaluated the effect of this parameter for different mathematical functions for a range of values between 2 and 6, and found that higher values of $C_{f}$ lead to increase of competition between individuals of different niches. This may have an adverse effect on maintaining stable populations in the niches with lower average fitness. In contrast, lower values lead to a significant increase of individuals not belonging to any of the niches. The author also cited that a value in the range [2, 4] is suitable for most applications. Accounting for all these findings, we therefore fixed its value to a medium value of 3 in order to balance the effect of this parameter in the low average fitness niches, and the number of individuals outside the niches.

The genome of the meta-MNC GA is thus composed by the following parameters: $C_{s}$ , $s$ , $\varepsilon$ and $P_{m}$ . The population has 50 individuals. An individual is basically an instance of the MNC algorithm to optimize. Two different criteria are used to define the fitness function of the meta-MNC GA. The first one is obviously the number of niches located by the underlying MNC GA. A given niche has to count for at least five members (a niche count of five) to be considered in the total number of niches found. The individual with a higher number of niches is considered therefore as the one with a higher fitness. In case of equality, a second criterion is used: the average HF among all the niches that qualify for accounting for the niche count. The individual with higher average is considered then the one with the highest fitness. Higher HF average indicates the ability of the algorithm to locate minima with higher energies, which makes it more efficient. Table 1 regroups the description of the meta-MNC GA genome along with the range in which each parameter varies. Figure 1 shows the flow of execution of the meta-MNC GA along with the underlying MNC algorithm to optimize.

Table 1
Genome definition of the meta-MNC GA

Parameter	Range		Value
Population size			300
Crowding selection size ( $C_{s}$ )	10	–100
Crowding factor size ( $C_{f}$ )			3
Crowding size ( $s$ )	10	–100
Interval crossover parameter ( $\varepsilon$ )	5	–30
Crossover probability ( $P_{c}$ )			1.0
Mutation probability ( $P_{m}$ )	0.01	–0.2

Figure 1.

The flow of execution of the meta-MNC GA.

Five different molecules with different sizes were used to perform the parameter optimization. Namely: glycine (Gly), di-glycine (Gly ${}_{2}$ ) in its zwitterions form, protonated di-glycine (Gly ${}_{2}^{+}$ ), Phenylalanine-Glycine (Phe-Gly) and tetra-glycine (Gly ${}_{4}$ ). These molecules have 3, 5, 7, 9 and 12 degrees of freedom respectively. Figure 2 shows an illustration of the molecules studied.

Figure 2.

The five molecules used in this study to optimize the parameters of the MNC GA.

The ultimate goal of this study is, while finding the most optimal combination of parameters for each of these molecular systems, learn some insights on how each parameter is related to the number of degrees of freedom and therefore to the size of the molecule being studied.

3.1 K-means clustering

K-means clustering is a well known and widely used unsupervised machine learning algorithm [26]. It allows identifying clusters or groups of similar characteristics in a given data set. We have used the version of the algorithm implemented in the scikit-learn Python library [27] to identify the niches in the final population of the meta-MNC GA.

K-means clustering algorithm starts with a set of N elements, in this case the final population of the meta-MNC GA, and group them based on Euclidean distance into K distinct groups or clusters of closest members. Each of the clusters is described by its mean or centoid $\mu_{j}$ . The algorithm objective is to minimize the within cluster sum of squares (WCSS) defined as [26]:

$\displaystyle\mathop{\sum}\limits_{i=0}^{n}\mathop{\min}\limits_{{\mu}_{j}}% \left({||{x}_{i}-{\mu}_{j}||^{2}}\right)$

Given the high variance of possible values across different parameters, scaling the data to the same order of magnitude is required for an optimal computation of the distances. The standard scaler implemented in the scikit-learn library [27] was used for all the parameters.

The elbow method is used to find out the optimum number of clusters for a given final population. This analysis method allows to find the optimum number of clusters by minimising WCSS. This optimum number is so that adding other clusters does not reduce the WCSS any further. Figure 3 shows an illustration of the elbow method for glycine.

Figure 3.

The illustration of the elbow method showing that the optimal number of clusters for glycine is five.

3.2 Principal component analysis (PCA)

PCA [28] is a statistical technique that simplifies a data set, reducing its dimensions by projecting the original variables to a new coordinate system so that the new variables (the principal components), are linear functions of the original ones. We are using it here to visually project the clusters computed previously to the first two components that represent the most of the variance of the original values. The projections were done using scikit-learn Python library [27].

Figure 4.

The Gly clusters colored by niche count from low (blue) to high (yellow), showing two clusters of high performance. PC1 and PC2 are the first principal components representing the most of the variance in the original set of the parameters.

4. Results and discussion

In order to facilitate the interpretation of the results, we have evaluated the performance of the MNC GA by assigning to each combination of parameters (which is basically an individual of the meta-MNC GA) a niche count category, based on the total number of niches obtained. The categories are defined as follows: The combination that lead to a niche count of less than 6, were considered to belong to the low performance category. Those having the niche count between 6 and 9 were regrouped in the medium performance category. Finally those with more than 9 were considered in the high performance category. Figure 4 shows an illustration of the clusters identified for Gly, projected in two dimensions using PCA showing two clusters of high performance. PC1 and PC2 are the first two principal components representing the most of the variance of the set of the parameters.

Table 2
The high performance parameter sets obtained for each of the molecules studied

Molecule	Parameter
	$C_{s}$	$s$	$\varepsilon$	$P_{m}$
Gly	41	85	20	0.11
	23	52	25	0.15
Gly ${}_{2}$	42	56	26	0.15
	57	89	19	0.10
Gly ${}_{2}^{+}$	35	77	15	0.07
	42	43	23	0.15
Phe-Gly	51	21	14	0.07
	88	82	9	0.06
	69	49	11	0.07
Gly ${}_{4}$	84	13	18	0.07
	63	56	11	0.04

Figure 5.

Distribution of the different parameters in the parameter set optimized by the meta-MNC GA highlighted by performance category for each of the molecules studied.

Table 2 regroups the high performance combinations obtained for each of the molecules studied. Figure 5 shows the distribution of the values of the parameters by performance category for the molecules used in this study.

Two sets of parameters leading to high performance fitness have been identified for each of the molecules studied; except for Phe-Gly for which an additional set was obtained. In the second set for Gly, a relatively low value of $C_{s}$ along with higher values of $\varepsilon$ and $P_{m}$ seems to be the best choice for this relatively small molecule. Decreasing the mutation probability $P_{m}$ by 0.04 requires to significantly increase the value of both $C_{s}$ and s parameters in order to get the same high performance. The same observation can be made for Gly ${}_{2}$ , which has two more degrees of freedom than Gly. In fact, in the first set for this molecule, a relatively medium range value of $C_{s}$ and s combined with higher values of $\varepsilon$ and $P_{m}$ gave the best results. An increase of $C_{s}$ has to be combined with a higher value of s and lower values of $\varepsilon$ and $P_{m}$ to reach a high performance category.

However, we cannot say the same for the medium size molecule Gly ${}_{2}^{+}$ . Indeed, while the values of the parameters in its second set have the same degree of magnitude as Gly ${}_{2}$ , changes in $P_{m}$ , $C_{s}$ and $\varepsilon$ between the two sets occurs in the same direction. That is, a decrease in those parameters along with an increase of s seems to lead to the same performance.

The picture is completely different for the bigger molecules Phe-Gly and Gly ${}_{4}$ . For those molecules, the values of $P_{m}$ are the lowest among all the set of molecules. Those values do not exceed 0.07 at most. Values of $\varepsilon$ are also very close to the lowest value in the range used for the optimization. In contrast, the values of $C_{s}$ are quite high, larger than 60 in all cases. We can also observe that the values of $s$ and $\varepsilon$ changes in an opposite way for each of those two molecules across the parameter sets. The same observation is also true for the other molecules for these two last parameters. According to Cedeno [13], the number of individuals outside the niches increases with higher values of $s$ . Our findings suggest that decreasing $\varepsilon$ is required to compensate this effect for the specific application to PES computation.

Figure 6.

Variation trends of the different parameters in the parameter set optimized by the meta-MNC GA as a function of molecule number of degrees of freedom. Blue lines within highlighted areas were obtained using linear regression to show the general variations of each parameter.

Figure 6 shows a representation of the high performance variation trends of the different parameters as a function of the number of degrees of freedom. Each of these graphs reveals that higher values of $s$ , $\varepsilon$ and $P_{m}$ seem to be the best choice for smaller molecules. Those values decrease as the degree of freedom increases. In contrast, smaller values of $C_{s}$ are suitable for smaller molecules, while higher values give the best results for bigger molecules. However, one has to keep in mind that those trends give guidance for the choice of the parameters for a given type of molecule as part of a combination of the set of parameters. for example a combination of $C_{s}$ , $s$ , $\varepsilon$ , and $P_{m}$ in the respective ranges of [18, 42], [55, 90], [20, 25] and [0.11, 0.18] is suitable for a small molecule like Gly, while a combination in the ranges [60, 90], [15, 62], [5, 18] and [0.02, 0.07] is more adapted for bigger molecules like Gly ${}_{4}$ .

It is worth mentioning that the opposite variation of $C_{s}$ and $P_{m}$ is expected due to the fact that both parameters contribute to control the diversity of the population. In fact, Cedeno has reported in his study that $C_{s}$ allows a certain degree of exploration in the search space [13]. Lower values of $\varepsilon$ for bigger molecules allow the crossover operation to be made in such a way that the offspring is closer to the parents. This seems to be important for maintaining a significant niche count within the niches, which may be due to the complexity of the PES for those molecules. For the parameter $s$ , although the trends shows that it tends to decrease for bigger molecules, higher values were also obtained for those molecules. However, in such cases, smaller values of $\varepsilon$ are required to balance the effect of $s$ on individuals outside the niches.

5. Conclusion

In this paper, a meta-GA approach was proposed to optimize the parameters of the MNC GA for PES computations at the semi empirical level of theory using the AM1 method along with a second layer of the same algorithm. The interaction of different parameters has been discussed for a set of molecules of different sizes. $C_{s}$ and $P_{m}$ contributes to the search space exploration. Therefore, their values need to be balanced in order to control the degree of exploration of the search space in the algorithm.

Smaller molecules seem to require a combination of higher values of the parameters $s$ , $\varepsilon$ and $P_{m}$ , along with a lower value of $C_{s}$ . We have observed that the opposite occurs for bigger molecules, for which higher values of $C_{s}$ combined with lower values of $s$ , $\varepsilon$ and $P_{m}$ seems to give the best results in terms of algorithm performance.

We are aware that parameter tuning is extremely time and computer resources consuming. Extending this study to different kind of molecules may be of a great interest. Taking advantage of statistical learning methods may greatly help to shed more light on the variation of those parameters for more extended molecular systems.

References

Assadollahzadeh

Bunker

P.R.

and Schwerdtfeger

, The low lying isomers of the copper nonamer cluster, Cu9, Chem. Phys. Lett. 451(4–6) (Jan. 2008), 262–269.

Llanio-Trujillo

J.L.

Marques

J.M.C.

and Pereira

F.B.

, An evolutionary algorithm for the global optimization of molecular clusters: application to water, benzene, and benzene cation, J. Phys. Chem. A 115(11) (Mar. 2011), 2130–2138.

Fuhrmann

Rurainski

Lenhof

H.-P.

and Neumann

, A new lamarckian genetic algorithm for flexible ligand-receptor docking, J. Comput. Chem. 31(9) (Jan. 2010), 1911–1918.

Pfeffer

Fober

Hüllermeier

and Klebe

, GARLig: A fully automated tool for subset selection of large fragment spaces via a self-adaptive genetic algorithm, J. Chem. Inf. Model. 50(9) (Sep. 2010), 1644–1659.

El Merbouh

Bourjila

Tijar

El Bouzaidi

R.D.

El Gridani

and El Mouhtadi

, Conformational space analysis of neutral and protonated glycine using a genetic algorithm for multi-modal search, J. Theor. Comput. Chem. 13(8) (Dec. 2014), 1450067.

Bourjila

El Merbouh

Tijar

El Guerdaoui

El Bouzaidi

El Gridani

and El Mouhtadi

, Polyalanine gas phase acidities determination and conformational space analysis by genetic algorithm assessment, Chemistry International 2(3) (2016), 145–157.

Tijar

El Merbouh

Bourjila

El Guerdaoui

El Bouzaidi

El Gridani

and El Mouhtadi

, Conformational space analysis of neutral and deprotonated forms of benzoic acid, salicylic acid and phthalic acid using a genetic algorithm, Chemistry International 2(4) (2016), 201–221.

El Guerdaoui

Tijar

El Merbouh

Bourjila

El Bouzaidi

El Gridani

El Mouhtadi

, Conformational analysis of diamide system HCO-L-Phenylalanine-NH2 by genetic algorithm, Chemistry International 2(4) (2016), 279.

El Guerdaoui

Tijar

El Merbouh

Bourjila

El Bouzaidi

R.D.

and El Gridani

, Exploring potential energy surfaces of biological molecules using a Multi-Niche Crowding genetic algorithm, J. Comput. Methods Sci. Eng. 17(3) (Aug. 2017), 595–609.

10.

El Guerdaoui

Tijar

El Merbouh

Bourjila

El Bouzaidi

R.D.

and El Gridani

, A comprehensive conformational space analysis of N-formyl-l-tryptophanamide system by using a genetic algorithm for multi-modal search, J. Mol. Graph. Model. 75 (Aug. 2017), 137–148.

11.

El Guerdaoui

et al., An exhaustive conformational analysis of N-formyl-l-tyrosinamide using a genetic algorithm for multimodal search, Comptes Rendus Chim. 20(5) (May 2017), 500–507.

12.

Cedeño

Vemuri

V.R.

and Slezak

, Multiniche crowding in genetic algorithms and its application to the assembly of DNA restriction-fragments, Evol. Comput. 2(4) (Dec. 1994), 321–345.

13.

Cedeno

, The multi-niche crowding genetic algorithm: Analysis and application, Ph.D.thesis, University of California D.

14.

Bartz-Beielstein

Lasarczyk

C.W.G.

and Preuss

, Sequential parameter optimization, in Proc. IEEE Congress on Evolutionary Computation 1 (2005), 773–780.

15.

S.Y.

Chen

H.M.

S.J.

and Chen

T.K.

, Design of accurate classifiers with a compact fuzzy-rule base using an evolutionary scatter partition of feature space. Systems, Man, and Cybernetics, Part B: Cybernetics, in Proc. IEEE Transactions on 34(2), (2004), 1031–1044.

16.

Min

Jeung Ko

and Seong Ko

, A genetic algorithm approach to developing the multi-echelon reverse logistics network for product returns, Omega, Elsevier 34(1) (2006), 56–69.

17.

Grefenstette

, Optimization of control parameters for genetic algorithms, IEEE Trans. Syst. Man Cybern. 16(1) (Jan. 1986), 122–128.

18.

Freisleben

and Härtfelder

, Optimization of Genetic Algorithms by Genetic Algorithms, in: Artificial Neural Nets and Genetic Algorithms Albrecht

R.F.

Reeves

C.R.

and Steele

N.C.

, Eds. Vienna: Springer Vienna, 1993, pp. 392–399.

19.

de Landgraaf

W.A.

Eiben

A.E.

and Nannen

, Parameter calibration using meta-algorithms, in: 2007 IEEE Congress on Evolutionary Computation, Singapore, 2007, pp. 71–78.

20.

Brodmeier

and Pretsch

, Application of genetic algorithms in molecular modeling, J. Comput. Chem. 15(6) (Jun. 1994), 588–595.

21.

Brain

Z.E.

and Addicoat

M.A.

, Optimization of a genetic algorithm for searching molecular conformer space, J. Chem. Phys. 135(17) (Nov. 2011), 174106.

22.

Dewar

M.J.S.

Zoebisch

E.G.

Healy

E.F.

and Stewart

J.J.P.

, Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model, J. Am. Chem. Soc. 107(13) (Jun. 1985), 3902–3909.

23.

Stewart

J.J.P.

, Optimization of parameters for semiempirical methods I. Method, J. Comput. Chem. 10(2) (Mar. 1989), 209–220.

24.

Stewart

J.J.P.

, MOPAC7.0. QCPE Program No. 455. Quantum Chemistry Program Exchange, Department of Chemistry, Indiana Univers.

25.

Deb

, Multi-objective optimization using evolutionary algorithms, 1st ed. Chichester; New York: John Wiley & Sons, 2001.

26.

Lloyd

S.P.

, Least squares quantization in PCM, Information Theory, IEEE Transactions on 28(2) (1982), 129–137.

27.

Pedregosa

et al., Scikit-learn: Machine learning in python, JMLR 12 (2011), 2825–2830.

28.

Jolliffe

I.T.

, Principal component analysis, 2nd ed. New York: Springer, 2002.

Multi niche crowding genetic algorithm parameter tuning for molecular potential energy surface computation

Abstract

Keywords

1. Introduction

2. Overview of MNC GA

3. Computational procedure

Table 1 Genome definition of the meta-MNC GA

Table 2 The high performance parameter sets obtained for each of the molecules studied

References

Table 1
Genome definition of the meta-MNC GA

Table 2
The high performance parameter sets obtained for each of the molecules studied