An accurate algorithm for multiple sequence alignment in MapReduce

Abstract

Multiple sequence alignment is one of the important research topics in computational biology and is widely used in the field of DNA and protein analysis. On the one hand, when the number and length of the sequences are increased when developing copy-number variant (CNV) and Single Nucleotide Polymorphisms (SNP), the multiple sequence alignment becomes very complicated and difficult; on the other hand, the accuracy of the sequence alignment directly influences the results of DNA or protein analysis. In this paper, a novel algorithm for multiple sequence alignment based on center star alignment and MapReduce framework is proposed. The algorithm adapts improved star align strategy so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. Experimental results show that the proposed algorithm has better accuracy than other existing algorithms and can relatively quickly align multiple sequences.

Keywords

Multiple sequence alignment center star alignment MapReduce

1. Introduction

At present, the main means of multiple DNA sequence alignment are center star alignment, random search, intelligent algorithm, methods based on graph theory, dynamic programming and so on [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Center star alignment is polynomial time-spending, which runs quickly, but the performance of comparison is not good [1, 2]. Random search and intelligent algorithm can deal with small data set well [3, 4, 5, 6, 7, 8]. Methods based on graph theory require a large amount of memory and running time [9, 10]. The present software tools of Multiple Sequence Alignment are mainly the MAFFT series [11], T-coffee series [12] and the Clustal series [13, 14]. Though they can better align the DNA sequences with poor homology, they cannot process vast amount of data with their own limitations when the number of sequences is large. In recent five years, there spring up some parallel algorithms and new algorithms in order to improve the efficiency and the accuracy, however, the data sets still not reach to some scale [15, 16, 17, 18, 19]. With the development of next generation sequencing and the rise of personal genomics, finding single nucleotide polymorphisms (SNP) and copy-number variant (CNV) has become one of the important research fields of computational molecular biology [2]. Researchers manage to find out SNP and CNV by comparing several similar DNA sequences. The present methods or software can seldom process hundreds of DNA sequences effectively [20, 21]. Based on the situation that sequences in comparison are mostly homologous and highly similar when finding SNP, this paper presents a method of multiple sequence alignment algorithms based on improved center star alignment in MapReduce. The algorithm adapts improved star align strategy [22] so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. The experiment has proven that the method presented in this paper is more accurate than other methods and it provides a foundation of finding the Molecular markers such as SNP, CNV.

2. Center star alignment

Star alignment algorithm is a fast heuristic method for solving the multiple sequence alignment problem. It is set up by aligning a fixed sequence and all other sequences, where the fixed sequence is center of the star. Using a technique called “once a space, always been to space” comparing the sequences towards the center. That is, in the process of optimization of center and other sequences, it will continue to add a space to center the sequence in order to fit the alignment, and never moved out of the space has been added. The process is repeated until all the other sequences and central sequence optimization has finished. Suppose there are n input sequences $\Psi={\{}s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$ }, the algorithm is as follows:

a.
Find the sc, which makes the $\sum$ Score ( $S_{i}$ , $S_{c})$ ( $i\neq c$ ) minimum, set A $=$ { $S_{c}$ }
b.
For $S_{i}\in\Psi$ -{ $S_{c}$ }, by adding some spaces to $S_{i}$ and $S_{c}$ respectively align $S_{i}$ and $S_{c}$ , thus $s_{i}$ is added into A. According the spaces which added to $S_{c}$ , adjust the already existing sequences in A. Then A $=$ { $S_{i}$ ’, $S_{c}$ ’}.
c.
repeat b until there are $\Psi$ is empty.

3. Improved center star alignment based on MapReduce (I-CSA-M)

The traditional star alignment algorithm needs align every two sequences, compute the similarity of every two sequences, and then find out the most similar sequence, that is the “Star center” (Center Star) sequence. Finding the center star sequence is in order to improve the matching effect.

By researching we find that the center sequence can be more accurate, so as to improve the accuracy of alignment. By looking for similar segments of each sequence, splicing these fragments, we get the first center star sequence. Then aligning the first center star sequence with each sequence and record the spaces which be added into the first center star sequence, we get the second center star sequence. Last, we align the second center star sequence with each sequence and we get the final alignment.

Because of a good data analysis of MapReduce framework, we implement the new algorithm on this platform and expect good results.

The processing of improved center star alignment in MapReduce follows three steps:

(1)
construction of the center sequence in MapReduce
(2)
construction of the new center sequence in MapReduce
(3)
get the final alignment in MapReduce.

The symbols used are listed in Table 1.

Table 1
The symbols used in this chapter

Symbol Meaning

$s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$ $n$ input sequences

$s_{c}^{1}$ The first center star sequence

$s_{c}^{2}$ The second center star sequence

$s_{ci}^{1}$ $s_{c}^{1}$ with inserted spaces after aligned with $s_{i}$

$s_{i}^{1}$ $s_{i}$ with inserted spaces after aligned with $s_{c}^{1}$

$s_{ci}^{2}$ $s_{c}^{2}$ with inserted spaces after aligned with $s_{i}$

3.1 Construction of the center sequence $s_{c}$

Symbol	Meaning
$s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$	$n$ input sequences
$s_{c}^{1}$	The first center star sequence
$s_{c}^{2}$	The second center star sequence
$s_{ci}^{1}$	$s_{c}^{1}$ with inserted spaces after aligned with $s_{i}$
$s_{i}^{1}$	$s_{i}$ with inserted spaces after aligned with $s_{c}^{1}$
$s_{ci}^{2}$	$s_{c}^{2}$ with inserted spaces after aligned with $s_{i}$

3.1.1 Cut the sequences into k-mers

There are n input sequences: $s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$ . The Map function cuts each sequence into k-mers (every two adjacent k-mers have $k-1$ same bases [23]), which are in the form of key-value pair, where the key is the k-mer, and the value is the number randomly generated. The length $k$ of a k-mer is 10 to 15. In our experiments, we set the k-mer length $k$ to 10. Suppose there are 3 sequences, the length of the sequences is 20 and set the $k$ to 6.

The input sequences are in Fig. 1.

Figure 1.

The input sequences.

The output k-mers of the map function are in Fig. 2. As Fig. 2 shows, the first 15 k-mers are generated by sequence 1, the second 15 k-mers are generated by sequence 2, and the last 15 k-mers are generated by sequence 3.

Figure 2.

The k-mers.

These k-mers are then sent to the reduce function. The reduce function counts the number of the same k-mers, leaves the one which has the smallest number, generates a number to each value k-mer in order, and then outputs a new set of key-value pairs, where the key is the number and the k-mer, and the value is consisted of three parts: the number of the sequence from which the k-mer comes, the position of the k-mer in this sequence and the count number of the k-mers. The output of the reduce function are as Fig. 3.

Figure 3.

The k-mers with weights.

Figure 4.

The soure code of Combine function.

3.1.2 Assemble k-mers according their repeated times

This part of is implemented by the middle.java. Middle.java takes the output of kmers.java as its input and do the steps in turn: the Map function removes first character and the last character of each of the k-mer, for example “ARSLPL”, and form two output elements < RSLPL,l 31> and < ARSLP,r 31>; Combine function collects corresponding value for the same key of the output value, forms the formation of <key, value-list>, sort by key, analyze the value-list and outputs all possible pairs, for example, <0,23>, <0,24>, <0,25>, <1,27>, <2,28>, <3,12>, <4,6>, <5,14>, <6,16>, <6,17>, <7,35>, <8,36>, <9,5>, <10,7>, and so on; Reduce the stage for the possible one-to-many cases according to the number of k-mers to choose. The source code of Combine and Reduce function are as shown in Figs 4 and 5.

Figure 5.

The soure code of Reduce function.

Figure 6.

The k-mers with same repeated times.

For the k-mers with the same repeated times, for example, there are 3 k-mers as Fig. 6 and the weights are the same. We could not make sure which k-mer to assemble, then we choose the last k-mer. The k-mer splicing before and after the order number stored in a queue array, according to the array of the sequence made of $s_{c}^{1}$ . The k-mer sequences construct the center sequence $s_{c}^{1}$ as Fig. 7.

3.2 Construction of the new center sequence

After obtaining the central sequence $s_{c}^{1}$ , the center sequence $s_{c}^{1}$ aligns with each of the original sequence of $s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$ . Then we get the n alignment as align1 to alignn, then inserting the spaces from each $s_{ci}$ to $s_{c}^{1}$ , we get $s_{c}^{2}$ . $s_{c}^{2}$ is the new center sequence. The algorithms are as follows:

algorithm 1: CenterWithOhters

input:

s_{c}^{1}

s_{1}

s_{2}

\ldots

s_{n}

output: align ${}_{1}$ , align ${}_{2}$ ,

\ldots

, align

{}_{n}

where align ${}_{1}$ means the alignment of

s_{c1}^{1}

and

s_{1}^{1}

align ${}_{2}$ means the alignment of

s_{c2}^{1}

and

s_{2}^{1}

\ldots\ldots\ldots

align ${}_{n}$ means the alignment of

s_{cn}^{1}

and

s_{n}^{1}

begin

for

i=

1 to

n

align ${}_{i}=$ AlignTwoSequences(s

{}_{c}^{1}

s_{i}

)

end

algorithm 2: AlignTwoSequences

input:

s_{c}^{1}

s_{i}

output:

s_{c}^{11}

and

s_{i}^{1}

begin

using map and reduce function complete following steps:

get the biggest same substring between

s_{c}^{1}

and

s_{i}

inserting spaces

end

Compare the center sequence $s_{c}^{1}$ and the 3 input sequences, we get 3 groups aligned results as Fig. 8 and $s_{c}^{2}$ as Fig. 9.

Figure 7.

The first center star sequence $s_{c}^{1}$ .

Figure 8.

The aligned results.

Figure 9.

The second center star sequence $s_{c}^{2}$ .

The map function is as follows, “middle” presents the first center star $s_{c}^{1}$ .

String out

=

new String();

for (int i

=

0; i < middle.length(); i++) {

int

j=

for (

j=

j<

middle.length() – (i - 1); j++) {

=

middle.substring(i, i + j);

int indexof

=

reads.indexOf(search);

if (indexof !

=

-

1) {

out

=

middle.substring(i, i

+

j);

} else {

break;

}

if(!out.equals("")){

CountM

=

CountM

+

context.write(new Text(out), new Text(String.valueOf(i)));

out

=

new String();

=

+

-

}

algorithm 3: NewCenterSequence

input:

s_{c}^{1}

s_{c1}^{1}

s_{c2}^{1}

\ldots

s_{cn}^{1}

output:

s_{c}^{2}

begin

for

i=

1 to

n

inserting the spaces from

s_{ci}^{1}

s_{c}^{1}

s_{c}^{2}=s_{c}^{1}

end

algorithm 4: Construct

S_{c}^{1}

input:

s_{1}

s_{2}

\ldots

s_{n}

output:

S_{c}^{1}

begin

Map function divides

s_{i}

into k-mers

Reduce function deletes the repeated k-mers, leaves the one which is in the smallest source sequence and output

<<n, m>,<p,q,r>>

Search the k-mer with the biggest r, assemble to left and right respectively, then get

S_{c}^{1}

Note:

n: the number of k-mer

m: the k-mer

p: the number of sequence from which the k-mer comes

q: the number of k-mer in its source sequence

r: the times the k-mer repeats

end

The flowchart of the improved center star alignment algorithm is as Fig. 10.

Figure 10.

The flowchart of the improved center star alignment based on MapReduce (I-CSA-M).

3.3 Get the final alignment

Align $s_{c}^{2}$ with $s_{1}$ , $s_{2}$ , $\ldots$ , $s_{n}$ one by one, we get the final alignment as Fig. 11.

3.4 Pairwise alignment

3.4.1 Pairwise alignment based on dynamic programming method

Sequence $S_{1}$ and $S_{2}$ are assumed to compare:

Figure 11.

The final alignment.

(1)

If $S_{1}$ $=$ $S_{2}$ $=$ null, do nothing; if $S_{1}$ $=$ null or $S_{2}$ $=$ null, inserting back into the shorter sequence of $S_{1}$ and $S_{2}||S_{1}|-|S_{2}||$ spaces; if $S_{1}$ and $S_{2}$ are not empty strings at the same time, searching the maximal same substring $\delta$ of $S_{1}$ and $S_{2}$ . If $\delta=$ null, inserting back into the shorter sequence of $S_{1}$ and $S_{2}||S_{1}|-|S_{2}||$ spaces. If $\delta\neq$ null, $S_{1}$ and $S_{2}$ are expressed as $S_{1}=\alpha_{1}\delta\beta_{1}$ and $S_{2}=\alpha_{2}\delta\beta_{2}$ , then turn to step (2).

(2)

Align the largest same substring $\delta$ , turn to step (3).

(3)

Recursively calls this algorithm, respectively align the left substring $\alpha_{1}$ and $\alpha_{2}$ of $S_{1}$ and $S_{2}$ , the right substring $\beta_{1}$ and $\beta_{2}$ .

3.4.2 Three ways to take the maximum string

Method A: map function will search and output all the substrings in $S_{1}$ and $S_{2}$ the length of which are more than or equal to 2. Reduce function searches output substrings of $S_{1}$ and $S_{2}$ and retrieval maximum matching substring.

Method B: Map function intercepts substrings in $S_{1}$ from position 0 the length of which is 2 and search each found substring to in $S_{2}$ . If there is the same substring, record the substring. Then intercepts substrings in $S_{1}$ from position 0 the length of which is 3 and check whether it exists in $S_{2}$ . Repeat the process and record the longest string in a continue region and output. Then Map function intercepts substrings in $S_{1}$ from position 1 the length of which is 2 and repeat the process until $S_{1}$ ends.

Method C: Map function intercepts substrings in $S_{1}$ from position 0 the length of which is 2 and search each found substring in $S_{2}$ . If there is the same substring, record the substring. Then intercepts substrings in $S_{1}$ from position 0 the length of which is 3 and check whether it exists in $S_{2}$ . Repeat the process and record the longest string in a continue region and output. If the length of the longest substring is 5, next interception in $S_{1}$ will from position 5. That is, interception begins from the 5th position. Repeat the process until $S_{1}$ ends.

4. Experiment

4.1 Test data

The hardware environment of experiment as follows: Lenovo sureserver of R680 G7, the server include 4 processors and each has 8 nuclears, the frequency of CPU is 2. 0GHZ, the memory capacity of server is 1024 G; The software environment of experiment as follows: The Linux operating system of RHEL6.4, Hadoop-1.2.1 and JDK 1.6.0.

The test data of Tables 1 and 2, Figs 12 and 13 we used are from http://mtsnp.tmig.or.jp/mtsnp/search_mtDNA_sequence_e.html [1]. We select three groups of data of mitochondrial gene; each of the group has 96 sequences. The three groups of mitochondrial gene come from normal people, people with Parkinson’s disease and people with Alzheimer’s disease respectively.

Figure 12.

Comparison of SP value.

4.2 Verification of algorithm accuracy

We used the proposed algorithm, the algorithm presented in reference [1] and the software ClustalW to three sets of data. The splicing results are shown in Table 2, and graphed in Fig. 12.

Table 2
Comparison of SP Value

Instance	Method	SP value
Normal people	ClustalW	182780
	[1]	181692
	I-CSA-M	165795
People with Parkinson’s disease	ClustalW	185447
	[1]	184985
	I-CSA-M	169156
People with Alzheimer’s disease	ClustalW	194142
	[1]	192963
	I-CSA-M	172745

In order to measure the effect of multiple sequence alignment, reference [1] proposed the measure method. As is shown in reference [1], the SP (sum of pairs) value means the sum of scores of the comparison between each pair of sequences. The sum of scores of the comparison between each pair of sequences means that if the two characters in the corresponding position are the same, then the score is 0, and if not, the score is 1. So, the fact is that the lower the value of SP, the better effectiveness of comparison is. So, from Fig. 11 and Table 1 we conclude that because of the improvement of center star alignment algorithm, the accuracy of our algorithm is better than [1] and ClustralW. As can be seen from Table 2, for the three sets of data, the results of our algorithm have more than 10,000 pairs matched bases than other two methods.

4.3 Verification of algorithm efficiency

The running time of our algorithm [1], and the ClustalW are shown in Table 3 and Fig. 13.

Table 3
Comparison of running time (s)

Instance	Method	Running time (s)
Normal people	ClustalW	More than 15 hours
	[1]	13.797
	I-CSA-M	404.733
People with Parkinson’s disease	ClustalW	More than 15 hours
	[1]	26.359
	I-CSA-M	405.592
People with Alzheimer’s disease	ClustalW	More than 15 hours
	[1]	26.406
	I-CSA-M	407.960

Figure 13.

Comparison of running time(s).

From Fig. 13 and Table 3 we conclude that because of the two times computation of center star sequence, the running time of our algorithm is not fast than [1], but fast than ClustralW. In addition, the calculation time of reference [1] is obtained by the cost of the accuracy. The reference [1] used a kind of k-band method to align two sequences approximately, so it saved time but lost accuracy.

5. Conclusion & discussion

In this study, we developed a multiple sequence alignment algorithm based on improved center star alignment in MapReduce. The algorithm takes full advantage of the MapReduce parallel programming framework. The map and reduce function we designed can not only collect the same k-mers when constructing the center sequence $s_{c}$ concurrently, but also search the maximum substring during pairwise alignment process concurrently. Parallel running of the algorithm shortened the running time effectively when comparing with the famous software ClustalW [1]. In addition, the improved center alignment algorithm, by two times searching of the center sequence, makes the comparison more accurate. We use the same data set with reference [1], the SP value of our algorithm is smallest [1].

The advantage of our algorithm is implementing the most significant research content of biology – multiple sequence alignment in the big data platform. It provides a good solution to the exponential growth of biological information data. It makes many things which stand-alone could not do possible.

The shortage of our algorithm is that the data set is relatively small, while the start-up of Hadoop system consumes a long time. Maybe in 404s’ running time, the start-up time of Hadoop system occupies most. When the amount of data continues to increase, the time advantage of our algorithm will become more and more obvious. Next, we will actively collect appropriate more large-scale data sets to verify our conjecture.

Another shortage of our algorithm is only calculating the mitochondrial data, while most of the existing parallel software is based on protein data as experimental data. Next, we will select a representative benchmark protein sequences as experimental data to assess the accuracy and efficiency of our algorithm.

In conclusion, The algorithm adapts improved center star align strategy so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. This algorithm is an innovative solution for DNA and protein multiple sequence alignment and analysis. The following conclusions can be drawn from our study: (1) Assembling of center sequence can make full use of map function and reduce function; (2) Searching and matching maximum sub string can make full use of map function and reduce function; (3) MapReduce is an ideal solution for multiple sequence alignment algorithm.

Footnotes

Acknowledgments

This research was financially supported by Chinese Natural Science Foundations (61363016, 61063004), Key Project of Inner Mongolia Advanced Science Research (NJZZ14100), Inner Mongolia Colleges and Universities Education Department Science Research (NJZC059), Natural Science Foundation of Inner Mongolia Autonomous Region of China (NO. 2015MS0605, NO. 2015MS0626, NO. 2017MS0605 and NO. 2015MS0627) and Ministry of Education Scientific Research Foundation for Study Abroad Personel [2014] 1685.

References

Zou

Shan

and Jiang

, A novel center star multiple sequence alignment algorithm based on affine gap penalty and K-band, Physics Procedia (33) (2012), 322–327.

Zou

Guo

Wang

and Zhang

, An algorithm for dna multiple sequence alignment based on center star method and keyword tree, Acta Electronica Sinica (37) (2009), 1746–1750.

Jonathan

M.K.

Peter

and Darryn

, A simulated annealing algorithm for finding consensus sequences, Bioinformatics (18) (2002), 1494–1499.

Lee

Z.J.

S.F.

Chuang

C.C.

and Liu

K.H.

, Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment, Applied Soft Computing (8) (2008), 55–78.

Kaya

Sarhan

and Alhajj

, Multiple sequence alignment with affine gap by using multi-objective genetic algorithm, Comput Methods Programs Biomed (114) (2014), 38–49.

Schwartz

A.S.

and Pachter

, Multiple alignment by sequence annealing, Bioinformatics (23) (2007), e24–e29.

Orobitg

Cores

Guirado

Roig

and Notredame

, Improving multiple sequence alignment biological accuracy through genetic algorithms, The Journal of Supercomputing (65) (2013), 1076–1088.

Narimani

and Hamid

, A new genetic algorithm for multiple sequence alignment, International Journal of Computational Intelligence and Applications (11) (2012), 1.

Löytynoja

Vilella

A.J.

and Goldman

, Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm, Bioinformatics (28) (2012), 1684–1691.

10.

Chen

Liao

Zhu

and Xiang

, Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm, Journal of Computational Chemistry (30) (2009), 2031–2038.

11.

Katoh

and Standley

D.M.

, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability, Molecular Biology And Evolution. 30(4) (2013), 772–780.

12.

Notredame

Higgins

D.G.

and Heringa

, T-Coffee: A novel method for fast and accurate multiple sequence alignment, Journal of Molecular Biology. (302) (2000), 205–217.

13.

Higgins

Bleasby

and Fuchs

, CLUSTAL V: Using clustal for multiple sequence alignment, Comput Appl.Biosci. (8) (1992), 189–191.

14.

Thomposon

J.D.

Gibson

T.J.

and Higgins

, CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice, Nucleic Acids Research (22) (1994), 4673–4680.

15.

Lalwani

Kumar

and Gupta

, A novel two-level particle swarm optimization approach for efficient multiple sequence alignment, Memetic Computing (7) (2015), 119–133.

16.

Blazewicz

Frohmberg

Kierzynka

and Wojciechowski

, MSA – A GPU-based, fast and accurate algorithm for multiple sequence alignment, Journal of Parallel and Distributed Computing (73) (2013), 32–41.

17.

Sievers

Dineen

Wilm

and Higgins

D.G.

, Making automated multiple alignments of very large numbers of protein sequences, Bioinformatics (29) (2013), 989–995.

18.

Boyce

Sievers

and Higgins

D.G.

, Simple chained guide trees give high-quality protein multiple sequence alignments, in: Proceedings of the National Academy of Science of the United States of America (111) (2014), 10556–10561.

19.

Shu

Elofsson

and Kalign

, Improved multiple sequence alignments using position specific gap penalties in Kalign2, Bioinformatics (2011), 27.

20.

Zou

Guo

and Wang

, HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy, Bioinformatics 31(15) (2015), 2475–2481.

21.

Chen

Wang

and Tang

, CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment, BMC Bioinformatics 18(1) (2017), 315.

22.

Zhou

, Research on parallel algorithm of multiple DNA sequence alignment based on de bruijn graph, Nature Biotechnology (2010).

23.

Compeau

P.E.

Pevzner

P.A.

and Tesler

, How to apply de Bruijn graphs to genome assembly, Nature Biotechnology (29) (2011), 987–991.

An accurate algorithm for multiple sequence alignment in MapReduce

Abstract

Keywords

1. Introduction

2. Center star alignment

3.1.1 Cut the sequences into k-mers

3.4 Pairwise alignment

3.4.1 Pairwise alignment based on dynamic programming method

4. Experiment

4.1 Test data

Table 2 Comparison of SP Value

Table 3 Comparison of running time (s)

Footnotes

Acknowledgments

References

Table 2
Comparison of SP Value

Table 3
Comparison of running time (s)