Multiple sequence alignment using enhanced bird swarm align algorithm

Abstract

Discovering structural, functional and evolutionary information in biological sequences have been considered as a core research area in Bioinformatics. Multiple Sequence Alignment (MSA) tries to align all sequences in a given query set to provide us ease in annotation of new sequences. Traditional methods to find the optimal alignment are computationally expensive in real time. This research presents an enhanced version of Bird Swarm Algorithm (BSA), based on bio inspired optimization. Enhanced Bird Swarm Align Algorithm (EBSAA) is proposed for multiple sequence alignment problem to determine the optimal alignment among different sequences. Twenty-one different datasets have been used in order to compare performance of EBSAA with Genetic Algorithm (GA) and Particle Swarm Align Algorithm (PSAA). The proposed technique results in better alignment as compared to GA and PSAA in most of the cases.

Keywords

Multiple sequence alignment Particle swarm optimization Bioinformatics Genetic algorithm swarm intelligence bird swarm algorithm

1 Introduction

In Bioinformatics, the sequence alignment [1] is a method of ordering the sequences of protein, DNA and RNA to find similar regions of corresponding chemical bases in order to discover structural, functional and evolutionary relationships between the sequences. The resulting alignment of sequences can be used in constructing phylogenetic trees, to find out the protein families based on the similarities of sequences to predict the new sequences and their behavior of secondary and tertiary structure from which the homology of old sequences with new ones can be seen [2, 3]. Multiple Sequence Alignment (MSA) is a method of aligning more than two sequences to find out best alignment throughout the whole search space. It is demonstrated as NP-hard problem [4]. The conventional methods (i.e., dynamic programming, neighbor joining method) failed in finding the aligned sequences which provide the exact solution demonstrating best fitness among all the solutions of search space. The cause of failure is the computational cost. This research is based on bio inspired technique i.e., Bird Swarm Algorithm, proposed by Xian-Bing Meng, X.Z. Gao, Lihua Lu, Yu Liu and Hengzhen Zhang, [5]. An enhanced version EBSAA is proposed to address the MSA problem and its comparison with evolutionary based Genetic Algorithm GA [6] and swarm based Particle Swarm Algorithm [7]. The main purpose of using EBSAA is to address the limitations of previous work. There is no parameter in GA or PSAA that can control the exploration or exploitation behavior of the population explicitly in MSA. These algorithms try to reach global optimum without exploring enough search space. Due to the nature of MSA problem, it requires more parameters that can explicitly increase or decrease the exploration or exploitation behavior. There are skew and redundant sequences with different length and similarity ratio which need some explicit randomness which is incorporated by EBSAA. It allows us to contribute diversity in the overall flow of algorithm. The major contribution of this research are:

An enhanced version EBSAA is proposed to address the MSA problem and is compared with the evolutionary based Genetic Algorithm GA [6] and swarm based Particle Swarm Algorithm. [7]

The main purpose of using EBSAA is to address the limitations of previous work. There is no parameter in GA or PSAA that can control the exploration or exploitation behavior of the population explicitly in MSA. These algorithms try to reach global optimum without exploring enough search space.

Due to the nature of MSA problem, it requires more parameters that can explicitly increase or decrease the exploration or exploitation behavior. There are skew and redundant sequences with different length and similarity ratio which need some explicit randomness which is incorporated by EBSAA. It allows us to contribute diversity in the overall flow of algorithm.

The rest of the paper is organized as follow. First, a comprehensive background and related work is presented in section II. Then, the GA, PSA and BSA are familiarized in section III. A new version of BSA, Enhanced Bird Swarm Align Algorithm (EBSAA) for MSA with formalized behavior of bird swarm is presented in section IV. Next section V presents experimental results in terms of graphs, charts and tables of different datasets. Conclusions are devised in section VI.

2 Background

Bioinformatics includes the study of different aspects of biology, computer science and mathematics which focused on different techniques of storing, organizing, retrieving and analyzing the biological data. [8] Bioinformatics deals with building software tools and frameworks that can deal with the data and get biological information from it. Aligning multiple sequences is one of the most important areas in bioinformatics and widely used medical research.

2.1 Multiple sequence alignment problem

Deoxyribonucleic acid (DNA), molecules and protein consist of biological sequences based on different chemical basis with different representations. DNA can be viewed as sequence of nucleotides, represented by four English alphabets A, T, C and G called Adenine, Thymine, Cytosine and Guanine respectively. While protein consists of different amino acid sequences that can be represented by 20 alphabets. In molecular biology, study consists of homology of different biological sequences which require these sequences to be aligned. Therefore, these sequences are arranged into a rectangular array with a goal of arranging maximum common residues in same column.

A continuous effort is needed in order to find the biological accuracy of these sequences that result in maximum sequences alignment. Later on, these are used to discover the evolutionary relations among different species, semi-independent realizations of an evolutionary process, predicting new sequence behaviors and to establish functional similarities in proteins. One sequence is transformed into another sequence by using gaps. There is a need to insert, delete or to move the gaps in sequences to find the maximum symbol matching among sequences. Gap insertion is more convenient in order to maximize similarity measure of sequences. For this purpose, a mathematical function called scoring function is used to find the scores based on similarities and dissimilarities of sequences. There are different schemes to assign penalties of gaps, matches and mismatches. Each scheme is based on a different matrix that assigns different score to every pair of symbol (based on mutation probability). Since, the process of finding an optimal solution that represents the maximum similarity matches among sequences incorporating less penalty and more scoring value is termed as Multiple Sequence Alignment MSA problem.

2.2 Substitution matrix scheme

The most widely used substitution matrix schemes are PAM (Percent Accepted Mutation) and BLOSUM (Block Amino Acid Substitution Matrix) [9]. PAM is based on punctual mutation constructed by Dayhoff in 1978 as shown in Fig. 1. PAM1 is a matrix that is constructed, based on 1% or less resembling sequences. There are three types of values i.e., positive, negative and zero value in PAM matrix. Positive value shows the mutation probability while negative value indicates the mutations which are not feasible. Zero means a perfectly random mutation. Since, the probabilities among different symbols to mutate others are different and hence the scoring of these probabilities are higher than others which have lower probability to mutate.

Fig. 1

The PAM250 matrix. It is the mostly used PAM matrix and represents the mutation probabilities of sequences with 20% of equivalence.

In this research, scores used for protein tests are based on the BLOSUM method, obtained by aligned sequences of BLOCKS database as shown in Fig. 2. This method has been proposed by Steven Henikoff y Jorja and G. Henikoff in 1992 [10], and they derived substitution matrices from 2000 blocks of aligned sequence segments characterizing about 500 groups of proteins.

Fig. 2

The BLOSUM62 matrix.

BLOSUM has different versions as BLOSUM45, BLOSUM62 and BLOSUM80 which indicate same percentages of sequences. BLOSUM45 indicates 45% of sequences identity, and similarly 62% for BLOSUM62 and 80% for BLOSUM80 respectively. To calculate mutation probability, these similar sequences are grouped and constructed with the probability expected from random number. A BLOSUM matrix with 62% sequence identity is used which is neither very high i.e., 80% nor very low i.e., 45% sequence similarity.

2.3 Sequence alignment methods

The sequences obtained from proteins and DNAs are not aligned and have different number of chemical bases that are represented by English alphabets. When these alphabets are combined, they result in different lengths of sequences. Therefore, there is a need to insert gaps in the sequences. These gaps are then shuffled in the sequence either one by one or together in the blocks to align the same residues. The goal is to find the maximum number of symbols aligned among all the sequences. In two sequences alignment problem, optimal alignment sequence is quite simple. An exact solution can easily found using dynamic programming paradigm. The problem arises when there are more than two sequences as time taken by dynamic programming grows exponentially in this case. In case of two alignments, there are many algorithms that can efficiently solve the two sequence alignment problem with minimum usage of memory and computation cost. Aligning multiple sequences is computationally complex problem as it has exponential number of combinations in order to find optimal solution [11].

Residues in the sequences must preserve their positions and they do not change their locations in the sequence. For example, a sequence ATGAC may have gaps as AT-G-AC (legal) but residues must not change their position ATAGC (illegal, because G has changed its position in sequence). Thus, different approaches are applied in order to find the most aligned sequences. Most of them are based on progressive alignment strategies while rest of them are either iterative or heuristic based techniques.

The progressive approach uses two most similar sequences to construct the optimal alignment and then progressively approach towards best solution by adding rest of the sequences one by one in the solution. There are possibilities to get trapped into local optima because of the absence of sequence rank that determines which sequence should be most appropriate to add in solution. CLUSTAL W [12], where W stands for weight is the most common tool that has been used and is based on progressive approach. It is a three step process including pairwise alignment construction, building guide tree and progress the alignment, guided by the tree.

Scoring function based approaches help to find optimal solution by ranking using highest rank priority. These techniques helped to resolve local optima problem but are computationally expensive approaches. Some existing sequence alignment methods and techniques with their pros and cons are highlighted in Table 1.

Table 1
Pros and cons of some existing techniques used for solving multiple sequence alignment problem

Multiple sequence alignment techniques

Name Detail Pros Cons

Dynamic Programming Constructs N dimensional matrix with N sequences and find the optimal path by doing score calculations. [13] Optimal solution can be found using DP. It depends on number of sequences. Computational cost and memory usage exponentially increase with number of sequences.

Progressive Alignment (CLUSTAL W) Two sequences are aligned together in this technique. Then for each pair, a guide tree is created based on similarity score. CLUSTAL W is one of a well know Progressive alignment technique. A greedy approach is used which is more efficient than dynamic programming. [14] Alignment of sequences totally depends on early pair alignment. Any mistake happens in early stage affects the rest of the process. Time increase is proportional to the no. of sequences.

CLUSTAL W2 It is also a progressive alignment method that calculates different gap penalty scores by doing progressive alignment many times and choose the best score. [15] There are more chances in getting optimal aligned sequences as the best score indicates the most aligned sequences. For generating, calculating and finding the optimal solution, it takes more time each time.

KALIGN It is also a standard progressive alignment method. The difference is that it does not depends on k-tuple used by CLUSTAL[16]. Instead uses Wu-Manber as an estimate of pairwise distance.[17] It depends on the Wu-Manber as an estimate which improves the performance and accuracy as compared to k-tuple methods. For an optimal solution, time taken was still not practical.

Consistency based Consistency based approach follow the principle of maximizing the agreement of

approach pairwise alignment. It looks each time for the created tree which results in

high accuracy and use in further progressive alignment. DIALIGN [18]

and TCOFFEE [19] are two consistency base approaches.

DIALIGN This consistency base techniques combines the features of both global and local alignments. This method emphasis more on local homologies and similarities among different sequences to discover functional regions among different sequences. Anchored DIALIGN [20], DIALIGN-TX [21], PFAM DB based DIALIGN-PFAM [22] are some of the famous variants of DIALIGN approach. Both local and global alignments are taken into consideration and their combining features results in more efficient solutions. Only using consistency base approach does not guarantee the optimal solution. This method alone does not provide accurate results.

TCOFFEE This consistency-base technique created libraries of local and global pairwise alignments. These pairwise alignments are combined to solve multiple sequence alignment.

Progressive-iterative This approach re-do the multiple sequence alignment in each iteration.

approach This looping process continues until there is no improvement in the final score.

MAFFT [23], MUSCLE [24], PRRP [25], PRIME [26] and MUMMALS/PROMALS [27]

are more significant variants of progressive-iterative approach.

Probabilistic consistency transformation-based strategy In this technique, for every pairwise alignment done on the whole sequences of the database, the probability of the distribution is decided before creation of the tree. This is used from probcon [28], probalign [29], CLUSTAL Omega [30] and many other tools. It gives more accurate results when combined with progressive-iterative approach Computations becomes complex. Usage of memory becomes very high.

Iterative approach based computational intelligence techniques This technique based on simulated annealing and evolutionary computation methods to solve the problem by giving equal priority to all the sequences. Accurate results are achieved using simple computations when both iterative and computational approaches are combined Using only iterative approach solve the problem but very slowly [31] like simulated annealing. So it is combined with computational intelligence to enhance performance.

Multiple sequence alignment techniques
Dynamic Programming	Constructs N dimensional matrix with N sequences and find the optimal path by doing score calculations. [13]	Optimal solution can be found using DP.	It depends on number of sequences. Computational cost and memory usage exponentially increase with number of sequences.
Progressive Alignment (CLUSTAL W)	Two sequences are aligned together in this technique. Then for each pair, a guide tree is created based on similarity score. CLUSTAL W is one of a well know Progressive alignment technique.	A greedy approach is used which is more efficient than dynamic programming. [14]	Alignment of sequences totally depends on early pair alignment. Any mistake happens in early stage affects the rest of the process. Time increase is proportional to the no. of sequences.
CLUSTAL W2	It is also a progressive alignment method that calculates different gap penalty scores by doing progressive alignment many times and choose the best score. [15]	There are more chances in getting optimal aligned sequences as the best score indicates the most aligned sequences.	For generating, calculating and finding the optimal solution, it takes more time each time.
KALIGN	It is also a standard progressive alignment method. The difference is that it does not depends on k-tuple used by CLUSTAL[16]. Instead uses Wu-Manber as an estimate of pairwise distance.[17]	It depends on the Wu-Manber as an estimate which improves the performance and accuracy as compared to k-tuple methods.	For an optimal solution, time taken was still not practical.
Consistency based	Consistency based approach follow the principle of maximizing the agreement of
approach	pairwise alignment. It looks each time for the created tree which results in
high accuracy and use in further progressive alignment. DIALIGN [18]
and TCOFFEE [19] are two consistency base approaches.
DIALIGN	This consistency base techniques combines the features of both global and local alignments. This method emphasis more on local homologies and similarities among different sequences to discover functional regions among different sequences. Anchored DIALIGN [20], DIALIGN-TX [21], PFAM DB based DIALIGN-PFAM [22] are some of the famous variants of DIALIGN approach.	Both local and global alignments are taken into consideration and their combining features results in more efficient solutions.	Only using consistency base approach does not guarantee the optimal solution. This method alone does not provide accurate results.
TCOFFEE	This consistency-base technique created libraries of local and global pairwise alignments. These pairwise alignments are combined to solve multiple sequence alignment.
Progressive-iterative	This approach re-do the multiple sequence alignment in each iteration.
approach	This looping process continues until there is no improvement in the final score.
MAFFT [23], MUSCLE [24], PRRP [25], PRIME [26] and MUMMALS/PROMALS [27]
are more significant variants of progressive-iterative approach.
Probabilistic consistency transformation-based strategy	In this technique, for every pairwise alignment done on the whole sequences of the database, the probability of the distribution is decided before creation of the tree. This is used from probcon [28], probalign [29], CLUSTAL Omega [30] and many other tools.	It gives more accurate results when combined with progressive-iterative approach	Computations becomes complex. Usage of memory becomes very high.
Iterative approach based computational intelligence techniques	This technique based on simulated annealing and evolutionary computation methods to solve the problem by giving equal priority to all the sequences.	Accurate results are achieved using simple computations when both iterative and computational approaches are combined	Using only iterative approach solve the problem but very slowly [31] like simulated annealing. So it is combined with computational intelligence to enhance performance.

Beside above-mentioned techniques, some methods are in combined/hybrid form and gives better results. For instance, Hidden Markov model based alignment techniques are also used which use progressive approach that combines both consistency-based and probabilistic modeling techniques. Due to this combination, the results of this approach improved as compared to other consistent methods. Similarly, MSAprob method produces sequences that are based on pairwise posterior probability matrix which is constructed using pair-HMM [32] and partition function and calculates the matrix using Backward and Forward algorithms. Pruning techniques are also used to reduce the search space by simply prune the least fit solutions. Some divide and conquer algorithms [33] divide the sequences using optimal cut, solve the subproblems by aligning the sequences separately and then join these small parts to produce the final Multiple Aligned Sequences MSA [34]. Carrillo-Lipman bound [35] technique is a familiar way to determine the constraints of resultant optimal multiple alignment.

2.4 Computational intelligence base methods

Evolutionary computation revolutionized the field of computer science by providing efficient and near-optimal solutions for large scale, complex, exponential and multi-dimensional problems. It starts with a random solution of problem in search space which usually encoded in form of string called chromosomes. Each chromosome has its objective function also known as fitness function which is used to evaluate the fitness according to the required criteria. A population is a collection of chromosomes. The chromosomes which are more fit are chosen for next iteration and inserted in population. Low fitness chromosomes are removed from the population in each iteration. Crossover and Mutation are used as genetic operators to explore and exploit the search space. Parent selection method is used to select the new parents in next population formation. This process iteratively continued until termination criteria is reached. The Genetic Algorithm (GA) is based on population size, crossover, mutation rate and elitism. Elitism determines how many chromosomes from the current population are eligible to take part in next population. There are a lot of GA versions which are used to solve MSA problem including SAGA [36], GLOCSA [37], GLOCSA-GGA, VDGA [38], GAPAM [39], RBT-GA [40], MSA-GA [41] and many more.

Nature inspired swarm-based algorithms have also been used excessively in order to solve MSA problem. Most famous algorithm from this family is Particle Swarm Optimization PSO algorithm which is based on the social behavior of the bird flocks or fish schools. Every particle in the swarm try to optimize its own position as well as the position in the whole swarm based on their speed.

A swarm intelligence based particle swarm align algorithm (PSAA) for MSA has been used in several different ways [42 –45] in order to find optimal solution. Ant colony optimization ACO [46], Fish Swarm Algorithm FSO [47] and several more algorithms are used to solve MSA problem [48 –50].

In this article, a new version of Bird Swarm Algorithm (BSA) i.e., Enhanced Bird Swarm Align Algorithm (EBSAA), is proposed to solve the MSA problem. Before going into the details of EBSAA, a brief introduction of BSA is presented.

3 Bird Swarm Algorithm BSA

A generic outline of BSA [5] algorithm is shown in Table 2. Bird Swarm Algorithm (BSA) is proposed by Xian-Bing Meng et al in 2015 and is inspired by the natural behavior of birds. Some rules are observed in the swarm of birds on which the behavior of whole swarm depends. Each bird can have two types of behaviors in swarm, either it can forage or keep vigilance behavior. Both of these are stochastic behaviors in nature.

Table 2
Bird swarm algorithm

Bird Swarm Algorithm (BSA)

Input: N: number of individuals (birds) in the population

M: maximum number of iterations

FQ: frequency of bird’s flight behavior

P: probability of foraging for food

C, S, a1, a2, FL: are five constant parameters

t = 0; Initialize population and define related parameters

Evaluate N individual’s fitness value, and find the best solution

While (t <M)

If (t % FQ ≠ 0)

For i =1: N

If rand(0,1) <P

$x_{i, j}^{t + 1} = x_{i, j}^{t} + (P_{i, j} - - x_{i, j}^{t}) \times C \times rand (0, 1) + (g_{j} - - x_{i, j}^{t}) \times S \times rand (0, 1)$

Else

$x_{i, j}^{t + 1} = x_{i, j}^{t} + A 1 ({Mean}_{j} - - x_{i, j}^{t}) \times rand (0, 1) + A 2 (P_{k, j} - - x_{i, j}^{t}) \times S \times rand (- 1, 1)$

where $A 1 = a 1 \times \exp (- \frac{{pFit}_{i}}{sumFit + \in} \times N)$

and $A 2 = a 2 \times \exp (- \frac{{pFit}_{i} - {pFit}_{k}}{| {pFit}_{i} - {pFit}_{k} | + \in} \times (\frac{N \times {pFit}_{k}}{sumFit + \in}))$

End if

End for

Else

Divide swarm into two parts: producers and scroungers.

For i = 1: N

If i is a producer

$x_{i, j}^{t + 1} = x_{i, j}^{t} + randn (0, 1) \times x_{i, j}^{t}$

Else

$x_{i, j}^{t + 1} = x_{i, j}^{t} + (x_{k, j}^{t} - x_{i, j}^{t}) \times FL \times rand (0, 1)$ where k = [1 , 3,…N], k ≠ i and FL∈ [0,2]

End if

End for

End if Evaluate new solutions

If new solutions are better than their previous ones, update them

Find current best solution

t = t + 1; End while

Output: the individual with the best objective function value in the population

Each bird can prompt its best previous experience and the swarm’s best experience about food during foraging. Social information is shared among whole swarm. This behavior is same as in Particle Swarm Algorithm in which each bird maintains its previous position and try to move towards the global best of swarm to achieve maximum fitness. In vigilance situation, each bird tries to move towards the center of swarm as it is the safest place in the whole swarm. This behavior can be effected by the interference induced by the competition among swarm. More likely, birds with higher reserves (high fitness value) should have higher chance to move towards the center of swarm while those with low fitness values should have lower chance to move towards the center. Birds can move from one place to another after some interval which indicates the switch of search space after several iterations. While flying, birds may behave like producing or scrounging. The birds with highest reserves should become producers while lower reserves should become scroungers. The birds with mixed value between high and low will randomly select between producers or scroungers. Producers will always search for food while scroungers will randomly follow the producer or search the food. When termination criteria is met and solution does not improve further after several iterations, the best bird will become the solution as final output.

4 Proposed method EBSAA

Enhanced Bird Swarm Align Algorithm (EBSAA) is proposed to solve multiple sequence alignment problem. It is an optimization problem where fitness value is considered as score. The main objective of EBSAA is to maximize the scoring function to find the optimal value of the alignment. In the proposed algorithm, a bird represents a sequence alignment.

There are three types of behavior of birds. Birds will forage based on their experience, and update their best value with tendency to move towards global best. Birds may show vigilance behavior and try to move towards the center of the swarm. Birds that have high reserves will get higher chances to move towards center and vice versa. Birds may be producers or scroungers. Producers move towards optimization while scroungers either follow producers or forage. EBSAA does not contain scrounger behavior, instead it just changes the position and initiate foraging in new search space position. The pseudo code of EBSAA is represented in Table 3.

Table 3
Bird swarm algorithm

Enhanced Bird Swarm Align Algorithm (EBSAA)

Input: N: number of individuals (birds) contained by the population

M: maximum number of iterations

FQ: frequency of birds’ flight behaviors

P: probability of foraging for food

gbest: bird with best fitness value

t = 0; Initialize population and define related parameters

Evaluate N individuals’ fitness value, and find best solution

While (t <M)

If (t % FQ ≠ 0)

For i =1: N

If rand(0,1) <P

Measure distance between gbest and every bird

Move every bird towards gbest

Determine the most experience bird

Else

Measure distance of average fit bird with every bird.

Move every bird towards center

Determine the most experienced bird

End if

End for

Else

Change position of birds in search space by adding some random value at current position and start foraging at new space.

End for

End if Evaluate new solutions

If new solutions are better than their previous ones, update them

Find the current best solution

t = t + 1; End while

Output: the individual with the best objective function value in the population

The algorithm terminates when reaching at maximum number of iterations or after achieving the required score that cannot be further improved after several iterations.

Population is a set of birds that are spread over the whole search space. These birds will try to score best while foraging based on their previous experience and the experience of gbest. Birds can exhibit vigilant behavior and in this case they try to move towards the center of the swarm and find safest positions where their survival chances are comparatively high. The chance of moving towards the center of the swarm is based on score. If a score of a bird is better, it indicates higher experience, and results in the higher chance of that bird to move in the center of swarm. Similarly, birds behave as producers based on their high experience (score value) in case of switching the land by flight and by changing the search space.

Implementation details are discussed below to elaborate the steps (initialization, data representation, scoring function calculation etc.) of EBSAA.

4.1 Data representation

A bird swarm consists of number of birds {0 < i< N} and each bird has some experience ‘e’ {0 < e< max} where max is the maximum experience a bird can achieve. A bird ‘i’ with experience e is e_i = max is nominated as the leader (gbest) due to maximum experience. The experience of each bird changes and there is a need to store these experiences to track the best experience {best (e_i)} for each bird in the search space (local best). In the proposed algorithm, a single bird represents a sequence alignment with ‘n’ sequences in it. This alignment is represented as set of vectors ‘V’ where set V is represented as V={v₁ U v₂ ... U v_n}. Each vector represents location of gaps with size ‘s’ and maximum allowed gaps in a single sequence are not greater than ‘s’. The ‘s’ can be different for each sequence. Therefore, a set of n sequences to be aligned corresponds to an n-dimensional search space. According to the problem, it is not necessary to store whole alignment with alphabets because there is no need to use extra memory for it. Only the position of gaps to be inserted are stored as bird with ‘n’ vectors and ‘s’ gaps in them. In this way, a bird represented by set of vectors that represents the locations where the gaps are added. An example can be viewed in this representation: Suppose we have following bird alignment:

The whole population will initialize with this alignment and the gaps that are added randomly in this bird makes each generated bird different from each other due to different randomly generated alignments. Note that, first sequence has length 7, second has length 5 and third one has length 6. Now each sequence will have different number of gaps to get a same length alignment. If we add 2 gaps in second sequence and one gap in third sequence, then all sequence will have same length 7. To add some diversity, we need to add more gaps in the alignment because it will create exploration and more chances to get optimal alignment. If number of gaps is very large, it will take lot of computation to align the sequences that results in more time. The best approach is to take medium number of gaps so that exploration and exploitation both can be achieved easily. In our example, we set the length of each sequence in the alignment to 9 which indicates two gaps in sequence one, four gaps in sequence two and three gaps in sequence three. To get the position of gaps, generate random number each time between 1 and the length (9 in this case) which tells us the position at where the gaps will be inserted.

For two birds, randomly generated gap positions are shown and bird one represents the leader:

These birds indicate that there is a gap after 2^nd and 5^th location in the first sequence of first bird of given alignment. The second sequence contains gaps at 2^nd, 3^rd, 5^th and 6^th position while third sequence has gaps at 1^st, 4^th and 5^th location as shown below:

4.2 Initialization of Bird population

The user defines size of the population i.e., number of birds that we should have in order to initialize the swarm. Moreover, length of an alignment that has a minimum value given by the length of the largest sequence, and a maximum length is given for an instance, as twice the length of the largest sequence.

The solutions are initialized randomly by inserting the gaps in input sequences randomly. A percentage gap parameter with 15% gap penalty is used that determines percentage of gaps that is required to be inserted to pad all the sequences in the same length. If the gap penalty becomes very high, the processing time and fitness of the solution will disturb. A very large percentage of gap penalty results in unnecessary large size of search space. If the gap penalty is very small, the necessary gaps are omitted and results in the disturbance of exploration. This may result in exploitation of search space. This gap percentage is added to the sequence that is the longest in length. These randomly inserted gaps transform all the alignments in equal length. The generated swarm contains birds equal to the given population size.

In initialization phase, each bird in the population is provided with the survival chance that is represented by its fitness. If the fitness value of initially generated bird is high, its chance of being survived are higher. In case of very low fitness, a certain threshold for a bird is maintained in the form of some probability of being selected. Moreover, a best bird (bird with highest fitness) is identified from the beginning which essentially move in the next generation to make sure that the score of the population will not drop in the new generation.

4.3 Scoring function

The score of a bird represents its fitness and generated by calculating the fitness of all combinations of sequences. For example, in Fig. 3, if sequence is S={s₁, s₂, s₃}, then the global alignment score is sum of the score of s₁, s₂ plus s₂, s₃ plus s_1, s₃ as shown in Fig. 4. In case of protein, the combination of alphabets obtained by BLOSUM62 are used to identify the fitness while in case of DNA, appropriate scores are set as match (AA)=1, mismatch (AG)=-1, gap at either position with alphabet (-A or A-)=-2. If both are gaps, it will be zero.

4.4 Distance measure

In EBSAA, there is a need to measure the distance while bird is foraging or vigilance in order to move towards center. To maximize the experience of each bird in the swarm, its distance is calculated with the global best. The distance is measured as the proportion of gaps that does not match in sequences. 100% mismatch indicates the maximum distance between a bird individual and its leader global best which interprets that the foraging bird has zero experience of food search. Distance measure can be formulated as:

Distance = matchedgaps/totalgaps (I)

Moreover, two gaps will be matched when they are at same corresponding positions in the individual birds. Otherwise, it will be considered as mismatch. For example, consider the birds of Fig. 6.

Fig. 3

An alignment with 3 sequences.

Fig. 4

Random locations of gaps form a bird. Bird one represents the leader bird.

Fig. 5

Two individuals after random gap insertion.

Fig. 6

Gaps at identical positions are bold.

In this example, the highlighted digits at corresponding position shows that they match, which means that the birds are close to each other. If all the places match to each other, it means that the birds are identical and their distance is zero. How far these birds are from the global best can be calculated by equation (I). There are total of 18 gaps as shown in Fig. 5 in this example i.e., 9 in each bird.

The matches are highlighted i.e., 5 + 5 = 10. As per formula, out of total 18 gaps, 10 of them are matched to each other. The distance that is calculated by mismatches gives us d = 18 - 10 / 18 = 0.44. A bird with distance of 0.44 from global best indicates that it has tendency to move towards global best and currently away by 0.44 units. The important thing to note is the value of 4 that is present in both sequences but not considered as distance because it does not lie in the same position of same vector (sequence).

4.5 Average value for vigilance behavior

In this behavior, each bird wants to move towards the center of the swarm. Therefore, an average score of the swarm is calculated. The bird with score equals to the average score of swarm or nearest to it, is considered as a leader. For example, if there are three birds and bird a = 2, bird b = 6 and bird c = 9, then the average value is 5.66. The distance of each bird from 5.66 is 3.66, 0.44 and 3.44 respectively which results in bird ‘b’ is selected as a centered point and now the distance of every other particle is calculated with ‘b’. Although, it does not have highest score but it is more centralized in the bird swarm. Each particle will move towards it to make it more secure from outside. This behavior performs more exploitation in search space in order to find more average solutions.

The optimal solution with highest score is not selected in vigilance behavior because the highest score may exploit the population that results in the same behavior as in PSAA. In contrast, the lowest score disturbs the fitness of the bird in swarm. The most reliable and safest place for a bird in the swarm is not the edge (highest or lowest) value. It is the center that is more secure from the outside. The average value of the vigilance behavior is therefore used for maintaining a balance between exploration and exploitation.

4.6 Flight behavior and producers

A parameter is used in order to set the frequency (FQ) of swarm. This indicates the flight behavior of the swarm. If flight rate is high, bird swarm will stay less at same position and switch its positions from one place to another. This flight behavior helps to explore more search space and reduces the chance of trapping in local minima. When move to the next place, the birds start foraging food again based on their previous experience. Switching the search space is formulated as the addition of some random value multiply by original current position of bird swarm in original experience of bird in order to explore more search space. Then, normalize the values in order to make it between 0 and 1 which is the range of bird swarm. The birds that have experience, start foraging in order to gain more experience. The flight behavior can come several times usually after 3 or 5 iterations which indicates its importance in order to explore more solutions and gap combinations to get more exploration.

4.7 Operators

In traditional bird swarm algorithm, each bird can either move towards (g-best), or to the center or to switch the search space. In first case, it moves towards the (g-best) and does not explore the search space to get more exploration. In this case, it can miss the solution which may have best fitness value but cannot explore due to less exploration capacity. In traditional Particle Swarm Align Algorithm PSAA, a particle moves towards the global best only.

In EBSAA, moving in the center, makes search space congested and attracts each bird in the center. This results in some exploitation of solutions because they started converging towards the center of swarm. To handle this exploitation, producer behavior is added which shifts the swarm to other places in search space. After number of iterations, search space is explored but results in another problem i.e., there are lesser chances for newly generated solutions to explore the whole search space due to the nature of the solutions. The generated solutions improve themselves in a small number of iterations and even they explore their best possible combination. It demands to change the solution by shuffling the gaps. To overcome this problem, a similar operator of crossover from Genetic Algorithm (GA) is used. The EBSAA uses one-point crossover in which a random point is selected for cross over and two new children are generated. One-point crossover helps to mix the best part of different alignments. Due to the different sizes of alignments, the insertion of gaps in the sequence is more comfortable and efficient by using single point crossover as shown in Fig. 7. Due to the diversity of the problem, the gaps are inserted at start, in middle and at the end of the sequences. Furthermore, splitting of alignments creates new gaps. In case of using any other crossover operator, the management of gaps becomes very tedious and chances of getting well aligned children becomes very low. In that scenario, there is a need of using crossover again and again until it gets new alignments with better fitness. During this process, creation of new gaps needs to maintain the alignments of the sequences. To overcome this problem, a single point crossover is used that selects a random cut point several times to ensure that the fitness of generated children becomes probably high.

A segment form leader is replaced with the segment of particle. This replacement is achieved by removing from the particle, the gaps that are in the segment, and then adding the gaps from the segment of the leader.

For example, if crossover point in above example is 6, it means that the new particles as shown in Fig. 8 will be generated as:

There are equal chances of either increasing or decreasing the length of sequences. For example, the fifth position of leader particle has no chemical base which indicates that if this whole position will be deleted, it will have no impact on the sequence. So it will be deleted and new resultant sequences as shown in Fig. 9 are given as:

Since the distance between the bird one and bird two is 0.44 and the length of each bird individual now becomes 8. So the crossover point will be in between 1 and 0.44*8 = 3. This clearly indicates that due to high value of score, now the need is to explore more solutions for other combinations. So it is crossover of the major part of bird (greater length of bird individual) with minor (lesser part of best bird individual) of (gbest). Same is the case when crossover will happen with the average part. If the crossover point is too small, the algorithm will start to explore more solutions in work space.

Finally, if the distance between (gbest) and the bird will be greater than 0.5, the large part of (gbest) will be used in crossover. Same is the case with average value. The producers will switch the work space by adding some random value in the current place. In the result of crossover, after several iterations, there are chances for the algorithm to trap in local minima and to tackle this EBSAA uses gap insertion, gap deletion, gap column deletion and gap shuffling in order to inject diversity in the solution. If the sequence length increases from the original one, insert gaps at the end of the sequence to make all sequences align. Otherwise, delete some gaps to align the solution.

5 Experiment results

The proposed algorithm EBSAA is implemented in order to test its performance on twenty-one different datasets with one test dataset.

5.1 Datasets

The datasets consist of different protein and DNA alignments that are discussed in the detail in Table 5 are represented in encoded form. From the total of 21 datasets, five of them consist of DNA sequences including Mickey Mouse (test sequence), MYH16 [51], beta globin [52], HIV [53] and BRCA1 [54] sequences. Rest of them consists of protein sequences including samples from BaliBase [55] benchmark, GAG and PAL sets of HIV/SIV (Viral CDS) [56], PDGH (Platelet-Derived Growth Harmone) sequences and SKEW (test) sequences and all of these have different length and similarity percentage [57].

Table 4
Parameter settings for GA, PSAA and EBSAA

Population size/unchanged 20/20, 100/20, 100/50, 400/50,

Rounds 250/100, 250/1000, 250/10000

Gap penalty 15%

No of rounds Unlimited rounds

Stoppind criterion Unchanged rounds reached

Foraging probability 0.9

(EBSAA)

Flight behavior (EBSAA) 3

Population size/unchanged	20/20, 100/20, 100/50, 400/50,
Rounds	250/100, 250/1000, 250/10000
Gap penalty	15%
No of rounds	Unlimited rounds
Stoppind criterion	Unchanged rounds reached
Foraging probability	0.9
(EBSAA)
Flight behavior (EBSAA)	3

Table 5

Results of 21 different datasets. Each value is the average of 10 runs

Fitness of EBSAA with MSAGA and PSAA
The results in this table are taken as average of 10 runs?
Sequences detail					Population size/Unchanged rounds
No	Sequence name	Protein/DNA	No of Seqs.	Longest sequence length	Algorithm	20/20	100/20	100/50	400/50	250/100	250/1000	250/10000
1	Amino Acid Test 1	Protein	7	153	MSAGA	–384	58	1190	1241	579	1420	1947
					PSAA	473	773	1192	1592	109	1467	2022
					EBSAA	–614	920	1213	1310	436	1700	2669
2	Amino Acid Test 2	Protein	10	217	MSAGA	13832	30811	29431	39139	30341	37938	43420
					PSAA	35515	39062	41358	41785	13697	42398	43329
					EBSAA	28565	35715	39797	41443	18827	40707	43456
3	DNA Mickey	DNA	7	10	MSAGA	–82	–75	–73	–72	–73	–71	–71
	Mouse Sequence				PSAA	–78	–74	–75	–73	–76	–72	–71
					EBSAA	–83	–75	–75	–78	–73	–73	–71
4	DNA MYH16	DNA	7	88	MSAGA	2254	2302	2322	2323	2323	2324	2324
	Sequences				PSAA	2245	2310	2315	2315	2323	2322	2324
					EBSAA	2245	2320	2314	2323	2298	2323	2324
5	DNA Beta Globin	DNA	5	160	MSAGA	–534	–15	39	173	–132	262	353
	Sequences				PSAA	–164	–116	–106	14	–193	107	302
					EBSAA	–506	–250	–140	0	–348	10	354
6	DNA HIV	DNA	6	643	MSAGA	–7705	–7139	–6006	–5219	–7404	–6193	–3372
	Sequences				PSAA	–4243	–4217	–4059	–3949	–7260	–4285	–3524
					EBSAA	–63122	–4735	–4376	–4060	–7225	–4624	–3491
7	DNA BRCA1	DNA	9	714	MSAGA	–20290	–17529	–15520	–10034	–19276	–13874	5661
	Sequences				PSAA	–8233	–7688	–7048	–6889	–19471	–10050	1854
					EBSAA	–16192	–10630	–8789	–8189	–19624	–11533	–5344
8	Bali Base 11001	Protein	4	48	MSAGA	53	53	53	53	53	53	82
	Sequences				PSAA	53	53	53	53	53	53	86
					EBSAA	–32	23	35	53	–6	54	86
9	Bali Base 30020	Protein	8	41	MSAGA	4059	4059	4059	4059	4059	4059	4059
	Sequences				PSAA	4059	4059	4059	4059	4059	4059	4059
					EBSAA	4059	4059	4059	4059	4059	4059	4059
10	Bali Base 50001	Protein	8	42	MSAGA	2315	2505	2512	2515	2515	2517	2517
	Sequences				PSAA	2221	2197	2434	2514	2514	2516	2517
					EBSAA	1953	2069	2434	2408	2369	2515	2517
11	HIV/SIV GAG04	Protein	4	47	MSAGA	713	713	732	725	717	717	810
	Sequences				PSAA	715	736	731	721	717	734	795
					EBSAA	659	725	730	729	717	717	795
12	HIV/SIV GAG08	Protein	8	45	MSAGA	1636	1665	1698	1700	1709	1762	1880
	Sequences				PSAA	1673	1702	1739	1748	1716	1756	1841
					EBSAA	1505	1696	1745	1739	1726	1782	1898
13	HIV/SIV POL04	Protein	4	35	MSAGA	–80	–47	–54	–26	–20	14 –	22
	Sequences				PSAA	–61	–42	–30	–26	–18	17 –	17
					EBSAA	–73	–53	–35	–29	–25	22	22
14	HIV/SIV POL08	Protein	8	35	MSAGA	40	85	58	100	105	127	230
	Sequences				PSAA	45	105	108	129	129	184	208
					EBSAA	39	83	94	143	126	166	230
15	PDGH FIRST07	Protein	7	50	MSAGA	–660	–508	–472	–424	–368	–304	–254
	Sequences				PSAA	–457	–632	–434	–433	605	–396	–280
					EBSAA	–834	–481	–418	–410	–465	–445	–280
16	PDGH LAST06	Protein	6	50	MSAGA	1545	1769	1800	1816	1809	1826	1826
	Sequences				PSAA	1758	1774	1785	1793	1802	1816	1825
					EBSAA	1360	1725	174	1747	1784	1799	1826
17	PDGH MEDIUM04	Protein	4	50	MSAGA	–226	–190	–178	–173	–178	–165	–152
	Sequences				PSAA	–213	–186	–182	–179	–177	–169	–154
					EBSAA	–217	–200	–180	–175	–170	–168	–154
18	INTENSE SKEW	Protein	6	49	MSAGA	–955	791	–758	–723	–800	–708	–708
	Sequences				PSAA	–812	–803	–747	–773	–900	–740	–708
					EBSAA	–972	–858	–775	–809	–787	–714	–708
19	LIGHT SKEW	Protein	6	47	MSAGA	–699	–661	–652	–643	–651	–630	–599
	Sequences				PSAA	–677	–664	–656	–621	–934	–613	–599
					EBSAA	–712	–528	–644	–634	–630	–622	–594
20	MODERAGE SKEW	Protein	6	48	MSAGA	–776	–742	–700	–704	–764	–693	–687
	Sequences				PSAA	–775	–720	–612	–713	–721	–688	–687
					EBSAA	–842	–802	–744	–799	–724	–704	–687
21	UNIOFORM SKEW	Protein	6	46	MSAGA	–692	–681	–676	–665	–660	–658	–645
	Sequences				PSAA	–678	–672	–659	–654	–660	–653	–635
					EBSAA	–676	–671	–670	–661	–656	–655	–634

All the datasets of protein are publicly available at [55] which have properties that are bit complex for computation of MSA and Phylogenetic Tree and intended to benchmark the accuracy and performance for new MSA techniques. The sequences taken from [55] datasets are grouped into normal equidistant (BaliBase11001), highly divergent (BaliBase30020), related subsets, internal insertions (BaliBase50001) and terminal extension scenarios to meet the criteria of real-world scenario.

5.2 Parameter settings

GA, PSAA and EBSAA are compared on same datasets. The parameter settings for the three algorithms are given in Table 4. The foraging probability and the foraging behavior are used by EBSAA only. These values are obtained after excessive experimentation.

5.3 Experimental details

All the datasets are run on GA, PSAA and EBSAA as detailed in Table 4. There are five DNA sequences and 16 protein sequence. In DNA sequence only 4 based (A, T, C, G) are used while in protein sequence, 21 amino acid bases are used. Number of sequences for each dataset are in range of 4-10. Less than that results in maximum fitness value even on static algorithms. Some of the datasets are tested to check the performance on the skewed sequences. Due to diversity of the results, average of 10 runs has been considered. Above four algorithms are compared, in order to test the performances to results. Experimental findings show that in some of the cases the results of all these algorithms are equal. In some case, EBSAA with optimized parameters shows better results. The machine used in this experiment is core i7 with 2.6 GHz clock speed, 8GB DDR3 RAM and 64 bit windows 8.1 operating system.

6 Discussion and analysis

For small population size with small number of unchanged rounds, PSAA and EBSAA could not explore the search space well and hence the performance obtained could not result in optimal fitness value as in shown in Fig. 10. In Fig. 11, the population size increases to a particular extant, but the algorithm terminates very soon and the results are bit improved due the diversity addition by increasing population size. When number of rounds are increases, the probability of exploration increases as shown in Fig. 12 but the significant improvements can be seen in Fig. 13 where the population size increased by 4 times. In this case, EBSAA experienced more exploration and tried to visit a bigger area of search space. The high value of foraging probability controlled the over-exploration and provided exploitation in that case.

Fig. 7

One-point crossover at 6^th location.

Fig. 8

Generated children from crossover.

Fig. 9

Child one after deletion of column gap.

Fig. 10

Fitness value of the algorithms for datasets with pop size = 20 and stop after 20 unchanged rounds.

Fig. 11

Fitness value of the algorithms for datasets with pop size = 100 and stop after 20 unchanged rounds.

Fig. 12

Fitness value of the algorithms for datasets with pop size = 100 and stop after 50 unchanged rounds.

Fig. 13

Fitness value of the algorithms for datasets with pop size = 400 and stop after 50 unchanged rounds.

After experimenting medium population size (250 in Fig. 14) with 100 unchanged rounds, the exploration behavior reflected back and the performance of EBSAA disturbs. In this case, PSAA exploits and reach to the maximum very soon. After increasing unchanged round to 1000 and 10000 (in Figs. 15 and 16), EBSAA explores the more search space and start performing better than GA and PSAA.

Fig. 14

Fitness value of the algorithms for datasets with pop size = 250 and stop after 100 unchanged rounds.

Fig. 15

Fitness value of the algorithms for datasets with pop size = 250 and stop after 1000 unchanged rounds.

Fig. 16

Fitness value of the algorithms for datasets with pop size = 250 and stop after 10000 unchanged rounds.

For small length of sequences, MSAGA, PSAA and EBSAA produce almost same solution due to smaller search space. When unchanged rounds are very large, all three algorithms successfully explore the best solution. For medium length DNA and protein sequences, the improvement in EBSAA is far significant than PSAA and MSAGA. The external factors i.e., foraging probability and flight behavior of a bird in EBSAA have played a significant role and provide more exploration. For lesser number of unchanged rounds, the exploration expands in the search space and a less fit bird is obtained as a final solution. When unchanged rounds are increased, the chance of getting more fit solutions increase and better solutions are achieved.

The datasets with large length exhibit significant changes when population size and unchanged rounds are increased. Amino Acid Test 1 & 2 sequences improved the results with more exploration. EBSAA explored and improved fitness values when number of unchanged rounds become very large and results in more fit solutions. Same is the case with DNA HIV and BRCA1 sequences in which the sequences are very long. For 10000 unchanged rounds, a significant change can be observed in the fitness score due to more exploration. DNA beta globin sequences showed significant changes in score when population size and unchanged rounds are increased.

7 Conclusion

In this study, a swarm intelligence based Enhanced Bird Swarm Align Algorithm (EBSAA) is presented for MSA problem. Genetic algorithm (GA) and Particle Swarm Align Algorithm (PSAA) are compared in terms of their performance on different datasets. Average of 10 runs is used due to diversity of results. For small dataset problems, GA performs better than PSA and EBSAA. For medium size datasets, performance of PSAA competes with EBSAA but EBSAA gives overall better average output. In most of the cases, the maximum value is achieved by EBSAA. Finally, when number of iterations are limitless, the performance of EBSAA outperforms GA and PSAA based approaches.

References

Chvátal

Vacláv

and Sankoff

David

, “Longest Common Subsequences of Two Random Sequences” Applied probability trust, Vol 12 No 2, June 1975.

Goldberg,

David E.

, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Pub.Co. 1989.

Mount,

David E.

, Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. 2001.

Kimand

Jin

, Sakti Pramanik “An Efficient Method for Multiple Sequence Alignment” AAAI, 1994.

Mengab

Xian-Bing

, Gaoc

X.Z.

, Lude

Lihua

, Liub

& Zhanga

Hengzhen

, “A new bio-inspired optimisation algorithm: Bird Swarm Algorithm” Journal of Experimental & Theoretical Artificial Intelligence, June 2015.

Radenbaugh

Amie

, “Applications of genetic algorithms in bioinformatics, “San Jose State University, SJSU scholar work, 2008.

Rodriguez

Pedro F.

, Niño

Luis F.

and Alonso

Oscar M.

, “Multiple sequence alignment using swarm intelligence”, International Journal of Computational Intelligence Research 3(2) (2007).

Cohen

, Bioinformatics—an introduction for computer scientists, ACM Comput. Surv 36(2) (2004), 122—158. 10.1145/1031120.1031122

Atwood,

Teresa.

, Introduction to Bioinformatics. 2nd.Ed. Prentice Hall. Bogotá, Colombia. 115–119. 2002.

10.

Henikoff

and Henikoff

J.G.

, Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences 89(22) (1992), 10915–10919.

11.

Suresh

and Vijayalakshmi

, A novel approach based on approximation and heuristic methods using multiple sequence alignments.

12.

Thompson

J.D.

, Higgins

D.G.

and Gibson

T.J.

, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequenceweighting, positions-specific gap penalties and weight matrix choice.

13.

Suresh

and Vijayalakshmi

, A novel approach based on approximation and heuristic methods using multiple sequence alignments, Indian J Appl Sci 3 (2013), 36–40.

14.

Lalwani

, Kumar

and Gupta

, A study on inertia weight schemes with modified particle swarm optimization algorithm for multiple sequence alignment. In 2013 Sixth International Conference on Contemporary Computing (IC3) (2013, August). (pp. 283-288). IEEE.

15.

Larkin

M.A.

, Blackshields

, Brown

N.P.

, Chenna

, McGettigan

P.A.

, McWilliam

and Higgins

D.G.

, Clustal W and Clustal X version 2.0. Bioinformatics 23(21) (2007), 2947–2948.

16.

Lassmann

and Sonnhammer

E.L.

, Kalign–an accurate and fast multiple sequence alignment algorithm, BMC Bioinformatics 6(1) (2005), 1–9.

17.

and Manber

, Fast text searching: allowing errors, Communications of the ACM 35(10) (1992), 83–91.

18.

Morgenstern

, Dress

and Werner

, Multiple DNA and protein sequence alignment based on segment-to-segment comparison, Proceedings of the National Academy of Sciences 93(22) (1996), 12098–12103.

19.

Notredame

, Higgins

D.G.

and Heringa

, T-Coffee: A novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology 302(1) (2000), 205–217.

20.

Morgenstern

, Prohaska

S.J.

, Pöhler

and Stadler

P.F.

, Multiple sequence alignment with user-defined anchor points, Algorithms for Molecular Biology 1(1) (2006), 1–12.

21.

Subramanian

A.R.

, Kaufmann

and Morgenstern

, DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment, Algorithms for Molecular Biology 3(1) (2008), 1–11.

22.

Al Ait

, Yamak

and Morgenstern

, DIALIGN at GOBICS—multiple sequence alignment using various sources of external information, Nucleic Acids Research 41(W1) (2013), W3–W7.

23.

Katoh

, Misawa

, Kuma

K.I.

and Miyata

, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Research 30(14) (2002), 3059–3066.

24.

Edgar

R.C.

, MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5) (2004), 1792–1797.

25.

Yamada

, Gotoh

and Yamana

, Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost, BMC Bioinformatics 7(1) (2006), 1–17.

26.

Gotoh

, Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, Journal of Molecular Biology 264(4) (1996), 823–838.

27.

Wang

and Li

K.B.

, An adaptive and iterative algorithm for refining multiple sequence alignment, Computational Biology and Chemistry 28(2) (2004), 141–148.

28.

C.B.

, Mahabhashyam

M.S.

, Brudno

and Batzoglou

, ProbCons: Probabilistic consistency-based multiple sequence alignment, Genome Research 15(2) (2005), 330–340.

29.

Roshan

and Livesay

D.R.

, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics 22(22) (2006), 2715–2721.

30.

Sievers

, Wilm

, Dineen

, Gibson

T.J.

, Karplus

, Li

and Higgins

D.G.

, Fast, scalable generation of high-quality proteinmultiple sequence alignments using Clustal Omega, MolecularSystems Biology 7(1) (2011), 539.

31.

Kim

, Pramanik

and Chung

M.J.

, Multiple sequence alignment using simulated annealing, Bioinformatics 10(4) (1994), 419–426.

32.

Durbin

, Eddy

S.R.

, Krogh

and Mitchison

, Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press (1998).

33.

Stoye

, Multiple sequence alignment with the divide-and-conquer method, Gene 211(2) (1998), GC45–GC56.

34.

Konagurthu

A.S.

and Stuckey

P.J.

, Optimal sum-of-pairs multiple sequence alignment using incremental Carrillo and Lipman bounds, Journal of Computational Biology 13(3) (2006), 668–685.

35.

Carrillo

and Lipman

, The multiple sequence alignment problem in biology, SIAM Journal on Applied Mathematics 48(5) (1988), 1073–1082.

36.

Notredame

and Higgins

D.G.

, SAGA: sequence alignment by genetic algorithm, Nucleic Acids Research 24(8) (1996), 1515–1524.

37.

Arenas-Díaz

E.D.

, Ochoterena

and Rodríguez-Vázquez

, Multiple sequence alignment using a genetic algorithm and GLOCSA, Journal of Artificial Evolution and Applications, 2009.

38.

Naznin

, Sarker

and Essam

, Vertical decomposition with genetic algorithm for multiple sequence alignment, BMC Bioinformatics 12(1) (2011), 1–26.

39.

Naznin

, Sarker

and Essam

, Progressive alignment method using genetic algorithm for multiple sequence alignment, IEEE Transactions on Evolutionary Computation 16(5) (2012), 615–631.

40.

Taheri

and Zomaya

A.Y.

, RBT-GA: a novel metaheuristic for solving the multiple sequence alignment problem, Bmc Genomics 10(1) (2009), 1–11.

41.

Gondro

and Kinghorn

B.P.

, A simple genetic algorithm for multiple sequence alignment, Genetics and Molecular Research 6(4) (2007), 964–982.

42.

Long

H.X.

, Xu

W.B.

, Sun

and Ji

W.J.

, Multiple sequence alignment based on a binary particle swarm optimization algorithm. In 2009 Fifth International Conference on Natural Computation (Vol. 3, pp. 265–269). IEEE. (2009, August).

43.

Long

H.X.

, Xu

W.B.

and Sun

, Binary particle swarm optimization algorithm with mutation for multiple sequence alignment. In Biology Forum/Rivista di Biologia (Vol. 102, No. 1) (2009, January).

44.

Chen

C.C.

, Two-layer particle swarm optimization for unconstrained optimization problems, Applied Soft Computing 11(1) (2011), 295–304.

45.

Lalwani

, Kumar

and Gupta

, A novel two-level particle swarm optimization approach for efficient multiple sequence alignment, Memetic Computing 7(2) (2015), 119–133.

46.

Chen

, Pan

, Chen

, Liu

and Chen

, Multiple sequence alignment by ant colony optimization and divide-and-conquer. In International Conference on Computational Science (pp. 646–653). Springer, Berlin, Heidelberg (2006, May).

47.

Yang

W.H.

, An improved artificial fish swarm algorithm and its application in multiple sequence alignment, Journal of Computational and Theoretical Nanoscience 11(3) (2014), 888–892.

48.

Lee

Z.J.

, Su

S.F.

, Chuang

C.C.

and Liu

K.H.

, Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment, Applied Soft Computing 8(1) (2008), 55–78.

49.

Kaya

, Sarhan

and Alhajj

, Multiple sequence alignment with affine gap by using multi-objective genetic algorithm, Computer Methods and Programs in Biomedicine 114(1) (2014), 38–49.

50.

Öztürk

and Aslan

, A new artificial bee colonyalgorithm to solve the multiple sequence alignment problem, International Journal of Data Mining and Bioinformatics 14(4) (2016), 332–353.

51.

McGuigan

, Phillips

P.C.

and Postlethwait

J.H.

, Evolution of sarcomeric myosin heavy chain genes: evidence from fish, Molecular Biology and Evolution 21(6) (2004), 1042–1056.

52.

Kosche

K.A.

, Dobkin

and Bank

, DNA sequences regulating human β globin gene expression, Nucleic Acids Research 13(21) (1985), 7781–7793.

53.

Mushahwar

I.K.

, Human immunodeficiency viruses: molecular virology,pathogenesis, diagnosis and treatment, Perspectives in MedicalVirology 13 (2006), 75–87.

54.

Lenoir

G.M.

, Lynch

, Watson

, Conway

, Lynch

, Narod

and Feunteun

, Familial breast-ovarian cancer locus on chromosome 17q12-q23, The Lancet 338(8759) (1991), 82–83.

55.

Thompson

J.D.

, Plewniak

and Poch

, BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics (Oxford, England) 15(1) (1999), 87–88. http://www-bio3d-igbmc.u-strasbg.fr/balibase

56.

Kuiken

, Foley

, Hahn

, Marx

, McCutchan

, Mellors

and Korber

, HIV sequence compendium 2001. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM. (2001).

57.

Heller

, MSG: A Gap-Oriented Genetic Algorithm for Multiple Sequence Alignment (2009).

Multiple sequence alignment techniques
Name	Detail	Pros	Cons
Dynamic Programming	Constructs N dimensional matrix with N sequences and find the optimal path by doing score calculations. [13]	Optimal solution can be found using DP.	It depends on number of sequences. Computational cost and memory usage exponentially increase with number of sequences.
Progressive Alignment (CLUSTAL W)	Two sequences are aligned together in this technique. Then for each pair, a guide tree is created based on similarity score. CLUSTAL W is one of a well know Progressive alignment technique.	A greedy approach is used which is more efficient than dynamic programming. [14]	Alignment of sequences totally depends on early pair alignment. Any mistake happens in early stage affects the rest of the process. Time increase is proportional to the no. of sequences.
CLUSTAL W2	It is also a progressive alignment method that calculates different gap penalty scores by doing progressive alignment many times and choose the best score. [15]	There are more chances in getting optimal aligned sequences as the best score indicates the most aligned sequences.	For generating, calculating and finding the optimal solution, it takes more time each time.
KALIGN	It is also a standard progressive alignment method. The difference is that it does not depends on k-tuple used by CLUSTAL[16]. Instead uses Wu-Manber as an estimate of pairwise distance.[17]	It depends on the Wu-Manber as an estimate which improves the performance and accuracy as compared to k-tuple methods.	For an optimal solution, time taken was still not practical.
Consistency based	Consistency based approach follow the principle of maximizing the agreement of
approach	pairwise alignment. It looks each time for the created tree which results in
high accuracy and use in further progressive alignment. DIALIGN [18]
and TCOFFEE [19] are two consistency base approaches.
DIALIGN	This consistency base techniques combines the features of both global and local alignments. This method emphasis more on local homologies and similarities among different sequences to discover functional regions among different sequences. Anchored DIALIGN [20], DIALIGN-TX [21], PFAM DB based DIALIGN-PFAM [22] are some of the famous variants of DIALIGN approach.	Both local and global alignments are taken into consideration and their combining features results in more efficient solutions.	Only using consistency base approach does not guarantee the optimal solution. This method alone does not provide accurate results.
TCOFFEE	This consistency-base technique created libraries of local and global pairwise alignments. These pairwise alignments are combined to solve multiple sequence alignment.
Progressive-iterative	This approach re-do the multiple sequence alignment in each iteration.
approach	This looping process continues until there is no improvement in the final score.
MAFFT [23], MUSCLE [24], PRRP [25], PRIME [26] and MUMMALS/PROMALS [27]
are more significant variants of progressive-iterative approach.
Probabilistic consistency transformation-based strategy	In this technique, for every pairwise alignment done on the whole sequences of the database, the probability of the distribution is decided before creation of the tree. This is used from probcon [28], probalign [29], CLUSTAL Omega [30] and many other tools.	It gives more accurate results when combined with progressive-iterative approach	Computations becomes complex. Usage of memory becomes very high.
Iterative approach based computational intelligence techniques	This technique based on simulated annealing and evolutionary computation methods to solve the problem by giving equal priority to all the sequences.	Accurate results are achieved using simple computations when both iterative and computational approaches are combined	Using only iterative approach solve the problem but very slowly [31] like simulated annealing. So it is combined with computational intelligence to enhance performance.

Multiple sequence alignment using enhanced bird swarm align algorithm

Abstract

Keywords

1 Introduction

2 Background

2.1 Multiple sequence alignment problem

2.2 Substitution matrix scheme

3 Bird Swarm Algorithm BSA

4.2 Initialization of Bird population

4.3 Scoring function

4.4 Distance measure

4.6 Flight behavior and producers

4.7 Operators

5 Experiment results

5.1 Datasets

Table 4 Parameter settings for GA, PSAA and EBSAA Population size/unchanged 20/20, 100/20, 100/50, 400/50, Rounds 250/100, 250/1000, 250/10000 Gap penalty 15% No of rounds Unlimited rounds Stoppind criterion Unchanged rounds reached Foraging probability 0.9 (EBSAA) Flight behavior (EBSAA) 3

5.3 Experimental details

6 Discussion and analysis

References

Table 4
Parameter settings for GA, PSAA and EBSAA

Population size/unchanged 20/20, 100/20, 100/50, 400/50,

Rounds 250/100, 250/1000, 250/10000

Gap penalty 15%

No of rounds Unlimited rounds

Stoppind criterion Unchanged rounds reached

Foraging probability 0.9

(EBSAA)

Flight behavior (EBSAA) 3