A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment

Abstract

The alignment among three or more nucleotides/amino acids sequences at the same time is known as multiple sequence alignment (MSA), a nondeterministic polynomial time (NP)-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem wherein the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the hybrid multiobjective memetic metaheuristics for MSA is proposed. To evaluate the parallel performance of our proposal, we have selected a pull of data sets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, tree-based consistency objective function for alignment evaluation (T-Coffee), Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , and multiple alignment using fast Fourier transform (MAFFT). The comparative study reveals that our parallel aligner obtains better results than MSAProbs, T-Coffee, Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , and MAFFT. In addition, the parallel version is around 25 times faster than the sequential version with 32 cores, obtaining an efficiency around 80%.

1. Background

Multiple sequence alignment (MSA) is the process of aligning three or more nucleotides/amino acids sequences at the same time (Bacon and Anderson, 1986). The main aim of MSA is to discover common ancestors among biological sequences. The discovery of biological relationship among several sequences is vital for inferring phylogenetic relationships among groups of organisms (Doolittle, 1981 ; Feng and Doolittle, 1987). Another important goal of MSA is the determination of biological significance among the given sequences (Notredame, 2002); therefore, we prioritize the conservation within regions throughout the alignment process. Finally, a proper alignment helps us to detect which regions of a gene are susceptible to mutation and which can have one residue replaced by another without changing the function.

The MSA problem is a nondeterministic polynomial time (NP)-complete optimization problem (Dogan and Otu, 2014); therefore, different heuristic approaches have been developed for solving this problem in an efficient amount of time. In the literature, we find three main groups: progressive methods, consistency-based methods, and iterative refinement methods.

The most representative progressive methods are Clustal W (Thompson et al., 1994), Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} (Sievers et al., 2011), PRANK (Loytynoja and Goldman, 2005), and Kalign (Lassmann et al., 2009). These methods start calculating a distance matrix from every pair of the given sequences, then, a guide tree is built by using any hierarchical clustering algorithm, such as unweighted pair group method with arithmetic mean; finally, the alignment is obtained by following the guide tree. The main disadvantage of progressive methods relies on the chance of including an inaccurate gap at the beginning that will be propagated to final alignment.

The second group includes those methods based on consistency. These approaches construct a database of local and global alignments between each pair of sequences that helps to build an accurate multiple alignment among all the given sequences. Among the most important consistency-based tools are tree-based consistency objective function for alignment evaluation (T-Coffee) (Notredame et al., 2000), PROBabilistic CONSistency-based MSA (ProbCons) (Do et al., 2005), and MSAProbs (Liu et al., 2010).

Finally, we find the iterative refinement tools. The methodology followed by these tools starts by performing a progressive alignment, and then they iterate with the aim of correcting any possible inaccurate gap inserted in the progressive construction stage. The refinement process is repeated until no further improvements are found or until a predefined number of iterations are reached. Among the most widely used iterative refinement methods, we find multiple sequence comparison by log-expectation (Edgar, 2004) and multiple alignment using fast Fourier transform (MAFFT) (Katoh et al., 2002).

Multiobjective optimization in conjunction with evolutionary computation is a powerful methodology to optimize real-world problems in the bioinformatics (Gonzalez-Alvarez et al., 2015) and telecommunication fields (Rubio-Largo et al., 2013a; Rubio-Largo et al., 2013b). In Rubio-Largo et al. (2016), we propose a hybrid multiobjective memetic metaheuristic for the multiple sequence alignment (H4MSA) for optimizing the quality and consistency of the final alignment simultaneously. The results reveal that H4MSA is a very promising tool.

Given a set of k nonaligned sequences wherein the length of the largest sequence is L, the time and space complexity for solving the MSA problem is O(k2^k L^k) (Waterman et al., 1976); therefore, finding an optimal alignment of a large number of sequences ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sim$$ \end{document} 1000 sequences) becomes computationally intractable. Some of the aforementioned tools allow parallelism: Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , T-Coffee, MSAProbs, and MAFFT; however, it is still a challenge to develop accurate methods that provide higher parallel efficiencies for aligning very large sets of sequences.

The main disadvantage of H4MSA is its prohibitive running time when the number of sequences increases. The main contribution of this article is an efficient parallel version of H4MSA for aligning sets of sequences with a large number of sequences accurately in a reasonable amount of time.

1.1. MSA problem

Given a set of sequences S: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ {s_1}$$ \end{document} , s₂, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\ldots$$ \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s_k} \} $$ \end{document} of lengths \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {s_1} \vert$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {s_2} \vert$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\ldots$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {s_k} \vert$$ \end{document} defined over an alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma$$ \end{document} , for example, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Sigma _{aminoacids}}$$ \end{document} = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.

An MSA of S is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ^\prime$$ \end{document} : \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ {s ^\prime _1}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _2}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\ldots$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _k} \} $$ \end{document} , where the length of the k sequences is exactly the same. Note that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ^\prime$$ \end{document} is defined over the same alphabet as S ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma$$ \end{document} ) with an additional gap symbol (−); so, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ^\prime$$ \end{document} is defined over the alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma \cup \{ - \} $$ \end{document} .

In this way, a multiple alignment is obtained by adding gaps to the sequences of S so that their lengths become the same. It can be seen as a matrix representation where the rows represent sequences and the columns represent aligned symbols. Each column of an alignment must contain at least one symbol of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma$$ \end{document} , in other words, a column with all gaps is not allowed. An example of MSA may be as follows:

• Unaligned sequences (input):

s₁: TSREETKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMA (43)

s₂: RKTALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMH (45)

s₃: FFMEKTAKYAKLHPEMSNSDLTKILSKKWKELPEKKKMKYIQDFQREKQEF (51)

s₄: VVAESTLKESSAINKILGRRWHALSREEQAKYYELARKERQLHMQL (46)

s₅: DYKKETESDIDKHLSDIWTVNKGSWVALGFSDGQEA (36)

• Aligned sequences (output):]

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _1}$$ \end{document} : TSREETK—KKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMA———- (56)

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _2}$$ \end{document} : RK—TA—LENPR–MRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMH— (56)

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _3}$$ \end{document} : FFMEKTAKYAKLHPE–MSNSDLTKILSKKWKELPEKKKMKYIQDFQREKQEF— (56)

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _4}$$ \end{document} : VVAESTL—K——-ESSAINKILGRRWHALSREEQAKYYELARKERQLHMQL (56)

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _5}$$ \end{document} : DYKKETE————-SDIDKHLSDIWTVNKGSWVALGFSDGQEA——- (56)

* * * *

To find an accurate alignment, we propose the use of multiobjective optimization. Therefore, we search the best solution (alignment) that simultaneously maximizes the weighted sum-of-pairs (WSPs) function with affine gap penalties (f₁) (Gupta et al., 1995) and the number of totally conserved (f₂) columns score (Edgar, 2004; Thompson et al., 2005).

On the one hand, the WSPs with affine gaps (f₁) need to maximize the following equations: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} WSP ( S ^\prime ) = \sum \limits_{l = 1}^{AL} SP ( l ) - \sum \limits_{i = 1}^k AGP ( {s ^\prime _i} ) . \tag{1} \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} SP ( l ) = \sum \limits_{i = 1}^{k - 1} \sum \limits_{j = i}^k {W_{i , j}} \times \delta ( {s ^\prime _{i , l}} , {s ^\prime _{j , l}} ) . \tag{2} \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { W_ { i , j } } = 1 - { \frac { LD ( { s_i } , { s_j } ) } { max ( \vert { s_i } \vert , \vert { s_j } \vert ) } } . \tag { 3 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} AGP ( {s ^\prime _i} ) = ( {g_o} \times \# gaps ) + ( {g_e} \times \# spaces ) . \tag{4} \end{align*} \end{document}

In Equation 1, AL is the alignment length, SP(l) is the sum-of-pairs score of the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${l^{th}}$$ \end{document} column, which is defined as is shown in Equation 2. Note that in Equation 2, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\delta$$ \end{document} is the substitution matrix used, either pointed accepted mutation or block substitution matrix (BLOSUM); and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${W_{i , j}}$$ \end{document} refers to the sequence weight between sequence s_i and s_j. To compute the weight between two sequences we use Equation 3, where LD refers to the Levenshtein distance between two nonaligned sequences—i.e., the minimum number of insertions, deletions or substitutions required to change one sequence into the other. In Equation 1, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$AGP ( {s ^\prime _i} )$$ \end{document} is the affine gap penalty score of sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _i}$$ \end{document} (see Eq. 4). In Equation 4, g_o is the weight to open the gap and g_e is the weight to extend the gap with one more space. In this work, we have used the BLOSUM62 substitution matrix, g_o = 6, and g_e = 0.85.

On the other hand, the number of totally conserved (f₂) columns score refers to the number of columns that are completely aligned with exactly the same compound. This objective function needs to be maximized to ensure more conserved regions within the alignment. In the aforementioned example, the symbol “*” indicates a column completely aligned with exactly the same compound; therefore, f₂ = 4.

The maximum number of columns (alignment length) was limited to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left\lceil {1.5 \cdot max ( \vert {s_1} \vert , \vert {s_2} \vert , \ldots , \vert {s_k} \vert ) } \right\rceil$$ \end{document} . The choice of 1.5 as a scaling factor allowed the alignment to be 50% longer than the longest sequence in the set. This choice was based on the observation that solutions to common alignment problems rarely contained more than 50% gaps.

The rest of the article is organized as follows. A detailed description of the parallel H4MSA algorithm is presented in Section 2. Section 3 is devoted to compare the alignment accuracy and parallel efficiency of H4MSA with other parallel approaches published in the literature. Section 4 discusses the obtained results, whereas Section 5 concludes the article and suggests future research avenues.

2. Methods

This section is divided into two parts. The first one is devoted to describe the H4MSA algorithm and study the points in which the algorithm spends more time. The second part presents a detailed description about the parallelization process of the metaheuristics.

2.1. H4MSA

The shuffled frog-leaping optimization algorithm (SFLA) is a memetic metaheuristic developed by Eusuff et al. (2006). The search procedure begins with a random population of frogs (solutions) in a swamp.

The population of frogs is divided into isolated communities (memeplexes) that will evolve independently, allowing different directions within the search space. At each community, the frogs evolve by sharing their ideas with their neighbor frogs. In this way, those frogs with better ideas will share more ideas than those frogs with poor ideas.

In addition, the best frogs of each community will share their ideas with other frogs in different communities. After a number of iterations, the communities of frogs are forced to mix and new communities are formed through a shuffling process to accelerate the convergence of the algorithm.

The chromosome representation of a solution in H4MSA is different from the traditional binary representation (1 indicates gap symbol and 0 indicates a residue). For example, the binary representation of the alignment shown in Section 1.1 is:

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _1}$$ \end{document} : 00000001110000000000000000000000000000000000001111111111

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _2}$$ \end{document} : 00111001110000011000000000000000000000000000000000000111

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _3}$$ \end{document} : 00000000000000011000000000000000000000000000000000000111

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _4}$$ \end{document} : 00000001110111111100000000000000000000000000000000000000

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _5}$$ \end{document} : 00000001111111111111000000000000000000000000000001111111

However, in H4MSA, a solution only stores the number of groups of gaps followed by the information of each group: position of the first gap and number of successive gaps (negative value). The chromosome representation of the example alignment will be

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s ^\prime _2}$$ \end{document} : 4, (3, −2), (8, −2), (16, −1), (55, −2)

Given a set of k unaligned sequences (S), m memeplexes (number of communities) with n frogs per memeplex, a fixed number of evolutionary steps (N), and a stopping criterion, the procedure of H4MSA is as follows:

1. Generate and evaluate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m \times n$$ \end{document} random alignments/frogs.

2. Sort the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m \times n$$ \end{document} frogs by alignment quality (f₁) and conservation (f₂). In this work, we have used the Fast Nondominated Sorting procedure (Deb et al., 2000).

3. Divide the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m \times n$$ \end{document} frogs into m memeplexes, such that the first best frog goes to the first memeplex (Y₁), the second best frog to the second memeplex (Y₂), the mth best frog to the mth memeplex (Y_m), the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m + 1th$$ \end{document} best frog to Y₁, and so on.

4. For each memeplex (Y_i)

(a) Select the local worst ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} ) and local best frog ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lb}}$$ \end{document} ) of the memeplex.

(b) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} learns from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lb}}$$ \end{document} , that is to say, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} replaces a portion of its alignment with information obtained from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lb}}$$ \end{document} , generating a new frog ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} ). For example, we select the following portion of the local best alignment:

and the following portion of the local worst ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} ):

and the resultant new frog ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} ) is constructed by taking into account both portions, filling with new gaps (represented by dots “.” in the example) until obtaining identical length:

(c) Apply the following mutation process to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} :

(i) Move a block: randomly select a block of gaps/compounds and move it one position toward the left or right.

(ii) Merge two groups: randomly select one of the sequences, choose a group of gaps/compounds, and merge it with the closest group.

(iii) Divide a group: randomly select one of the sequences, choose a group of gaps/compounds, and divide it into two new groups of approximately the same size.

(iv) Compact alignment: delete those columns with all gaps.

(d) Evaluate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} , if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} is better than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} , then go to Step 4(j).

(e) Select the global best frog ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{gb}}$$ \end{document} ).

(f) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} learns from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{gb}}$$ \end{document} , generating a new frog ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} ).

(g) Apply the mutation process to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} .

(h) Evaluate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} , if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} is better than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} , then go to Step 4(j).

(i) Apply a Local Search to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} and evaluate the new alignment produced. In our local search procedure, we use the fast and accurate Kalign2 (Lassmann et al., 2009) with the aim of realigning a portion of the input alignment. In the following, we present a step-by-step procedure of the local search: (i) compute a random position of the alignment and a random size in the range [5–25%] of the alignment length, (ii) remove all gaps in the selected portion, (iii) realign the portion with Kalign2 (Lassmann et al., 2009) method, and (iv) replace the old portion by the new portion. In the following we can see an example:

(j) Replace \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{lw}}$$ \end{document} by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_{new}}$$ \end{document} .

(k) If the maximum number of evolutionary steps (N) has not been reached, then go to Step 4(a). Otherwise, continue with the next memeplex.

5. Merge the frogs from the m memeplexes.

6. If the stopping criterion is satisfied, output the set of nondominated solutions; otherwise, go to Step 2.

As we can see, the output of H4MSA is a set of nondominated solutions, that is, a set of solutions that represents a trade-off between alignment quality (f₁) and consistency (f₂).

2.2. Parallel scheme

In this work, we propose a parallel scheme of H4MSA for a shared memory architecture. After studying the computational requirements of H4MSA, we can see that a large amount of time is spent in the initial generation of the population and in the evolving process of each memeplex; therefore, we focus on parallelizing these tasks. In Figure 1, we present a flowchart of the parallel scheme of H4MSA.

FIG. 1.

Parallel flowchart of H4MSA. FNDS, fast non-dominated sorting; H4MSA, hybrid multiobjective memetic metaheuristic for the multiple sequence alignment.

As we can see, in the generation of the initial population (Step 1), the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m \times n$$ \end{document} random alignments are divided among the available threads. So, if the number of threads is equal to M, each thread will be in charge of generating and evaluating \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { { m \times n } } { M } $$ \end{document} random alignments (frogs). We recommend M = m, that is to say, m threads in charge of generating n random frogs in parallel. Following this advice, we obtain the best parallel performance of H4MSA.

The second and third steps of H4MSA (Fast Nondominated Sorting procedure and the shuffle procedure) are carried out in a single thread mode, that is to say, only one thread is in charge of carrying out these two steps. The main reason is because the sorting and division process consumes no significant runtime in comparison with the other tasks of H4MSA. These steps consume ∼0.1% of a single generation runtime.

In Step 4, H4MSA performs m independent evolutionary processes of N iterations. It is clear that the tasks carried out within each evolutionary process (learning process, mutation process, and local search) are computationally expensive. Therefore, each of the m threads is in charge of evolving a memeplex in parallel. The workload of each memeplex may be different for each thread; therefore, instead of using a static distribution where each of the m threads is in charge of performing the consecutive \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { N } { m } $$ \end{document} iterations of the evolutionary process, in this parallel version, we use a dynamic distribution scheduling of the N iterations among the threads during runtime.

Since H4MSA has been implemented in OpenMP, we have parallelized the evolving of each memeplex by using the dynamic schedule. This schedule uses an internal work queue with the loop iterations to each thread. When a thread is finished, it retrieves the next loop iteration from the top of the work queue. The number of iterations performed by each thread is decided during the execution of the algorithm; so, threads may do different number of iterations. In Figure 2, we show the advantages of using a dynamic distribution of the iterations among threads when the workload of iterations is different. As we can see, 51.51% of efficiency is achieved with a static schedule; however, the efficiency increases up to 94.45% when the distribution of the iterations is dynamic.

FIG. 2.

Examples of sequential execution (a), parallel with static scheduling (b), and parallel with dynamic scheduling (c).

3. Results

In this section, we present a comparative study between our parallel approach (H4MSA) and other multithreaded MSA approaches published in the literature.

We have compared the multithreaded version of H4MSA with those MSA approaches that allow being run in multicore environments:

• MSAProbs (version 0.9.7) (Liu et al., 2010) It is a parallel and accurate approach for MSA. It allows the use of the -num_threads flag to specify the number of threads.

• T-Coffee (version 11.00.8) (Notredame et al., 2000). It is a widely used MSA approach in the field. The steps of T-Coffee are multithreaded by using the -multi_core flag, specifying the number of cores to use by the -n_core flag.

• Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} (version 1.2.1) (Sievers et al., 2011). It is the latest addition to the Clustal family. It offers a significant increase in scalability over previous versions, allowing a large number of sequences to be aligned. It also makes use of multiple processors by using the—thread flag.

• MAFFT (version 7.215) (Katoh et al., 2002). It is a method for rapid MSA based on fast Fourier transform. From version 6.8, MAFFT switches to the multicore version by simply specifying the number of threads with the—thread flag.

The aforementioned approaches were run by using the default parameter configuration.

In H4MSA, we found three main parameters: number of memeplexes (m), number of frogs at each memeplex (n), and the number of evolutionary steps (N). In this comparative study, we have used m = M (number of memeplexes equal to number of threads), n = 32 (32 frogs per memeplex), and N = 10 (10 evolutionary steps). The stopping criterion is based on the number of fitness evaluations: 50,000 evaluations.

The data sets used in these experiments were taken from the HOMFAM benchmark suite (Sievers et al., 2011). As we can see in Table 1, we have chosen data sets with different number of sequences, from 88 sequences to 1056 sequences, to evaluate the performance of the approaches when the number of sequences increases.

Table 1.

Selected HOMFAM Data Sets

Data set	No. of sequences	Maximum length	Minimum length
seatoxin	88	50	34
hip	162	73	54
cyt3	379	127	52
rnasemam	492	140	62
TNF	551	154	33
profilin	682	161	32
ricin	740	241	44
trfl	830	367	18
ltn	1056	267	24

To extract useful conclusions with a certain level of statistical confidence, 30 independent runs were performed for each approach involved, and the average runtime was used. The architecture selected for conducting these experiments was a cluster with two AMD Opteron™ Processor 6376 of 16 cores (a total of 32 cores) at 2.3 GHz, 12 MB Cache L3, 48 GB of DDR3 RAM, and Scientific Linux 6.1.

For measuring the performance of the parallel approaches, we have used two well-known metrics: speedup and efficiency. Amdahl's Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Therefore, Amdahl's Law defines the speedup that can be gained by using a particular feature. In a more formal way, let T_M be the runtime for an algorithm using M threads and T₁ the runtime of the sequential version, the speedup reports us how much faster an algorithm will run as opposed to the sequential version. The efficiency is computed by dividing the obtained speedup with M threads by the number of threads used (M).

To determine the accuracy of each MSA method, we have evaluated the level of conservation with the BLOSUM62 substitution matrix. Therefore, the alignment obtained by each approach for each data set was scored by using the +evaluate blosum62mt action provided by T-Coffee. Note that a higher score implies better alignment accuracy.

4. Discussion

This section discusses the obtained results and allows to draw important considerations with respect to the proposed system.

On one hand, in Table 2, we present the conservation score obtained by H4MSA, MSAProbs, T-Coffee, Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , and MAFFT. In Figure 3, we present a visual comparison among the five approaches in terms of conservation score. As we can see in Figure 3, the alignment accuracies obtained by H4MSA in all the data sets tested are better than the well-known approaches. In addition, if we focus on the largest data set (ltn, 1056 sequences), we observe an average conservation improvement around 14.98%. Therefore, we can conclude that H4MSA is able to obtain accurate alignments.

FIG. 3.

Comparison among H4MSA, MSAProbs, T-Coffee, Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , and MAFFT in terms of average conservation score for the selected data sets: seatoxin (a), hip (b), cyt3 (c), maseman (d), TNF (e), profilin (f), ricin (g), trfl (h), and ltn (i). Note that the number of sequences appears between brackets for each data set. MAFFT, multiple alignment using fast Fourier transform; T-Coffee, tree-based consistency objective function for alignment evaluation.

Table 2.

Average Conservation Score in 30 Independent Runs

Data set	H4MSA	MSAProbs	T-Coffee	Clustal Ω	MAFFT
seatoxin	704	688	686	681	689
hip	605	590	595	592	594
cyt3	485	347	357	367	389
rnasemam	572	547	550	546	550
TNF	527	462	466	463	467
profilin	604	543	554	541	551
ricin	576	472	493	477	485
trfl	588	537	542	544	546
ltn	584	485	501	499	501

H4MSA, hybrid multiobjective memetic metaheuristic for the multiple sequence alignment; MAFFT, multiple alignment using fast Fourier transform; MSA, multiple sequence alignment; T-Coffee, tree-based consistency objective function for alignment evaluation.

Bold represents the aligner with the highest average conservation score.

To ensure that the differences among the approaches in our comparative study are statistically significant, we perform a Wilcoxon signed-rank test between each pair of methods by using a confidence level of 1% (p-value <0.01). As we can see in Table 3, we can conclude that the differences of conservation score between H4MSA and any other method result statistically significant in the nine benchmarks tested.

Table 3.

Sets of Algorithms Where the Differences of Conservation Score Are Statistically Not Significant (p-value ≥ 0.01)

seatoxin	(MSAProbs, T-Coffee, MAFFT)
hip	(T-Coffee, MAFFT)
cyt3	—
rnasemam	(MSAProbs, Clustal Ω) and (T-Coffee, MAFFT)
TNF	(MSAProbs, Clustal Ω) and (T-Coffee, MAFFT)
profilin	—
ricin	—
trfl	(Clustal Ω, MAFFT)
ltn	(T-Coffee, Clustal Ω, MAFFT)

On the other hand, we compare the parallel performance of the approaches under study. In Table 4, we present the parallel speedup and efficiency obtained with different number of cores (M = 2, 4, 8, 16, and 32 cores). As we can see, the parallel performance of Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , T-Coffee, and MAFFT is very poor when the number of cores increases. We can also observe that MSAProbs presents a nice parallel performance with 2, 4, and 8 cores, but its efficiency decreases with 16 and 32 cores. The parallel efficiency of H4MSA remains >75% in all cases. In Figure 4, we present a comparison among the five parallel approaches in terms of runtime, speedup, and efficiency.

FIG. 4.

Runtime, average speedup, and average efficiency (%) obtained by each MSA tool. H4MSA (a), MSAProbs (b), T-Coffee (c), Clustal Ω (d), and MAFFT (e). MSA, multiple sequence alignment.

Table 4.

Speedup and Efficiency (%) Obtained by the Multithreaded Multiple Sequence Alignment Approaches

	H4MSA		MSAProbs		T-Coffee		Clustal Ω		MAFFT
No. of cores	S	E	S	E	S	E	S	E	S	E
2	1.96	98.21	1.91	95.46	1.65	82.26	1.10	54.79	1.61	80.44
4	3.71	92.82	3.55	88.66	2.42	60.52	1.18	29.49	2.53	63.32
8	6.87	85.88	6.43	80.36	3.09	38.59	1.23	15.39	3.58	44.76
16	12.82	80.13	10.92	68.28	3.67	22.96	1.29	8.04	4.03	25.18
32	24.98	78.12	17.38	54.33	4.58	14.30	1.32	4.14	4.30	13.42

S: Average speedup in the nine data sets of HOMFAM.

E: Average efficiency (%) in the nine data sets of HOMFAM.

In Figure 4, if we compare the sequential runtime of each approach, we can see that the fastest algorithms are (in order): Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , MAFFT, H4MSA, MSAProbs, and T-Coffee. However, thanks to the parallel efficiency of H4MSA, it is able to be the second fastest approach when the number of cores is set to 32.

All in all, we can conclude that H4MSA is not only an accurate alignment method but also its parallel performance allows it to handle data sets with hundreds of sequences in a reasonable amount of time.

5. Conclusions

A parallelization of the H4MSA is presented in this work. H4MSA is based on the SFLA, which provides the benefits of information mixture of the “shuffled complex evolution” technique. The parallel version of H4MSA has been compared with the parallel approaches of MSAProbs, T-Coffee, Clustal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Omega$$ \end{document} , and MAFFT when solving data sets with different number of sequences in the range (88–1056). We can conclude that the alignment accuracy and parallel performance of H4MSA are significantly better than other approaches published in the literature.

As future work, with the aim of solving larger data sets, we intend to develop a parallel version of H4MSA for shared- and distributed-memory architectures. In addition, we intend to develop other parallel swarm intelligence metaheuristics, for example, Rubio-Largo et al.(2012).

Footnotes

Acknowledgments

This work was partially funded by the AEI (State Research Agency, Spain) and the ERDF (European Regional Development Fund, EU), under the contract TIN2016-76259-P (PROTEIN project). Álvaro Rubio-Largo is supported by the postdoctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT), Portugal. Mauro Castelli and Leonardo Vanneschi are supported by project PERSEIDS (PTDC/EMS-SIS/0642/2014) and BiolSI RD unit (UID/MULTI/04046/2013), funded by FCT/MCTES/PIDDAC, Portugal.

Author Disclosure Statement

No competing financial interests exist.

References

Bacon

D.J.

, and Anderson

W.F.

1986. Multiple sequence alignment. J. Mol. Biol., 191, 153–161.

Deb

, Pratap

, Agarwal

, et al. 2000. A fast elitist multi-objective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput., 6, 182–197.

C.B.

, Mahabhashyam

M.S.

, Brudno

, et al. 2005. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340.

Dogan

, and Otu

H.H.

2014. Objective functions, 45–58. In Russell

D.J.

, ed. Multiple Sequence Alignment Methods, Methods in Molecular Biology. Vol. 1079.

Doolittle

1981. Similar amino acid sequences: Chance or common ancestry?. Science, 214, 149–159.

Edgar

R.C.

2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797.

Eusuff

, Lansey

, and Pasha

2006. Shuffled frog leaping algorithm: A memetic meta-heuristic for discrete optimization. Eng. Optim., 38, 129–154.

Feng

, and Doolittle

1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360.

Gonzalez-Alvarez

D.L.

, Vega-Rodriguez

M.A.

, and Rubio-Largo

2015. Finding patterns in protein sequences by using a hybrid multiobjective teaching learning based optimization algorithm. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 12, 656–666.

10.

Gupta

S.K.

, Kececioglu

J.D.

, and Schaffer

A.A.

1995. Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol., 2, 459–472.

11.

Katoh

, Misawa

, Kuma

, et al. 2002. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066.

12.

Lassmann

, Frings

, and Sonnhammer

E.L.L.

2009. Kalign2: High-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 37, 858–865.

13.

Liu

, Schmidt

, and Maskell

D.L.

2010. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics, 26, 1958–1964.

14.

Loytynoja

, and Goldman

2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. U. S. A., 102, 10557–10562.

15.

Notredame

2002. Recent progresses in multiple sequence alignment: A survey. Pharmacogenomics, 3, 131–144.

16.

Notredame

, Higgins

D.G.

, and Heringa

2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217.

17.

Rubio-Largo

, Vega-Rodríguez

M.A.

, Gómez-Pulido

J.A.

, et al. 2012. A parallel multiobjective artificial bee colony algorithm for dealing with the traffic grooming problem, 46–53. IEEE 14th International Conference on High Performance Computing and Communication. Liverpool, UK.

18.

Rubio-Largo

, Vega-Rodriguez

M.A.

, Gomez-Pulido

J.A.

, et al. 2013a. A multiobjective approach based on artificial bee colony for the static routing and wavelength assignment problem. Soft Comput. 17, 199–211.

19.

Rubio-Largo

, Vega-Rodriguez

M.A.

, and Gonzalez-Alvarez

D.L.

2013b. Applying MOEAs to solve the static routing and wavelength assignment problem in optical {WDM} networks. Eng. Appl. Artif. Intell., 26, 1602–1619.

20.

Rubio-Largo

, Vega-Rodriguez

M.A.

, and Gonzalez-Alvarez

D.L.

2016. A hybrid multiobjective memetic metaheuristic for multiple sequence alignment. IEEE Trans. Evol. Comput., 20, 499–514.

21.

Sievers

, Wilm

, Dineen

, et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539.

22.

Thompson

J.D.

, Higgins

D.G.

, and Gibson

T.J.

1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680.

23.

Thompson

J.D.

, Koehl

, and Poch

2005. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136.

24.

Waterman

, Smith

, and Beyer

1976. Some biological sequence metrics. Adv. Math., 20, 367–387.