Abstract
Multiple sequence alignment is one of the important research topics in computational biology and is widely used in the field of DNA and protein analysis. On the one hand, when the number and length of the sequences are increased when developing copy-number variant (CNV) and Single Nucleotide Polymorphisms (SNP), the multiple sequence alignment becomes very complicated and difficult; on the other hand, the accuracy of the sequence alignment directly influences the results of DNA or protein analysis. In this paper, a novel algorithm for multiple sequence alignment based on center star alignment and MapReduce framework is proposed. The algorithm adapts improved star align strategy so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. Experimental results show that the proposed algorithm has better accuracy than other existing algorithms and can relatively quickly align multiple sequences.
Introduction
At present, the main means of multiple DNA sequence alignment are center star alignment, random search, intelligent algorithm, methods based on graph theory, dynamic programming and so on [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Center star alignment is polynomial time-spending, which runs quickly, but the performance of comparison is not good [1, 2]. Random search and intelligent algorithm can deal with small data set well [3, 4, 5, 6, 7, 8]. Methods based on graph theory require a large amount of memory and running time [9, 10]. The present software tools of Multiple Sequence Alignment are mainly the MAFFT series [11], T-coffee series [12] and the Clustal series [13, 14]. Though they can better align the DNA sequences with poor homology, they cannot process vast amount of data with their own limitations when the number of sequences is large. In recent five years, there spring up some parallel algorithms and new algorithms in order to improve the efficiency and the accuracy, however, the data sets still not reach to some scale [15, 16, 17, 18, 19]. With the development of next generation sequencing and the rise of personal genomics, finding single nucleotide polymorphisms (SNP) and copy-number variant (CNV) has become one of the important research fields of computational molecular biology [2]. Researchers manage to find out SNP and CNV by comparing several similar DNA sequences. The present methods or software can seldom process hundreds of DNA sequences effectively [20, 21]. Based on the situation that sequences in comparison are mostly homologous and highly similar when finding SNP, this paper presents a method of multiple sequence alignment algorithms based on improved center star alignment in MapReduce. The algorithm adapts improved star align strategy [22] so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. The experiment has proven that the method presented in this paper is more accurate than other methods and it provides a foundation of finding the Molecular markers such as SNP, CNV.
Center star alignment
Star alignment algorithm is a fast heuristic method for solving the multiple sequence alignment problem. It is set up by aligning a fixed sequence and all other sequences, where the fixed sequence is center of the star. Using a technique called “once a space, always been to space” comparing the sequences towards the center. That is, in the process of optimization of center and other sequences, it will continue to add a space to center the sequence in order to fit the alignment, and never moved out of the space has been added. The process is repeated until all the other sequences and central sequence optimization has finished. Suppose there are n input sequences
Find the sc, which makes the For repeat b until there are
The traditional star alignment algorithm needs align every two sequences, compute the similarity of every two sequences, and then find out the most similar sequence, that is the “Star center” (Center Star) sequence. Finding the center star sequence is in order to improve the matching effect.
By researching we find that the center sequence can be more accurate, so as to improve the accuracy of alignment. By looking for similar segments of each sequence, splicing these fragments, we get the first center star sequence. Then aligning the first center star sequence with each sequence and record the spaces which be added into the first center star sequence, we get the second center star sequence. Last, we align the second center star sequence with each sequence and we get the final alignment.
Because of a good data analysis of MapReduce framework, we implement the new algorithm on this platform and expect good results.
The processing of improved center star alignment in MapReduce follows three steps:
construction of the center sequence in MapReduce construction of the new center sequence in MapReduce get the final alignment in MapReduce.
The symbols used are listed in Table 1.
The symbols used in this chapter
Cut the sequences into k-mers
There are n input sequences:
The input sequences are in Fig. 1.
The input sequences.
The output k-mers of the map function are in Fig. 2. As Fig. 2 shows, the first 15 k-mers are generated by sequence 1, the second 15 k-mers are generated by sequence 2, and the last 15 k-mers are generated by sequence 3.
The k-mers.
These k-mers are then sent to the reduce function. The reduce function counts the number of the same k-mers, leaves the one which has the smallest number, generates a number to each value k-mer in order, and then outputs a new set of key-value pairs, where the key is the number and the k-mer, and the value is consisted of three parts: the number of the sequence from which the k-mer comes, the position of the k-mer in this sequence and the count number of the k-mers. The output of the reduce function are as Fig. 3.
The k-mers with weights.
The soure code of Combine function.
This part of is implemented by the middle.java. Middle.java takes the output of kmers.java as its input and do the steps in turn: the Map function removes first character and the last character of each of the k-mer, for example “ARSLPL”, and form two output elements < RSLPL,l 31> and < ARSLP,r 31>; Combine function collects corresponding value for the same key of the output value, forms the formation of <key, value-list>, sort by key, analyze the value-list and outputs all possible pairs, for example, <0,23>, <0,24>, <0,25>, <1,27>, <2,28>, <3,12>, <4,6>, <5,14>, <6,16>, <6,17>, <7,35>, <8,36>, <9,5>, <10,7>, and so on; Reduce the stage for the possible one-to-many cases according to the number of k-mers to choose. The source code of Combine and Reduce function are as shown in Figs 4 and 5.
The soure code of Reduce function.
The k-mers with same repeated times.
For the k-mers with the same repeated times, for example, there are 3 k-mers as Fig. 6 and the weights are the same. We could not make sure which k-mer to assemble, then we choose the last k-mer. The k-mer splicing before and after the order number stored in a queue array, according to the array of the sequence made of
After obtaining the central sequence
Compare the center sequence
The first center star sequence 
The aligned results.
The second center star sequence 
The map function is as follows, “middle” presents the first center star
The flowchart of the improved center star alignment algorithm is as Fig. 10.
The flowchart of the improved center star alignment based on MapReduce (I-CSA-M).
Align
Pairwise alignment
Pairwise alignment based on dynamic programming method
Sequence
The final alignment.
If Align the largest same substring Recursively calls this algorithm, respectively align the left substring
Method A: map function will search and output all the substrings in
Method B: Map function intercepts substrings in
Method C: Map function intercepts substrings in
Experiment
Test data
The hardware environment of experiment as follows: Lenovo sureserver of R680 G7, the server include 4 processors and each has 8 nuclears, the frequency of CPU is 2. 0GHZ, the memory capacity of server is 1024 G; The software environment of experiment as follows: The Linux operating system of RHEL6.4, Hadoop-1.2.1 and JDK 1.6.0.
The test data of Tables 1 and 2, Figs 12 and 13 we used are from
Comparison of SP value.
We used the proposed algorithm, the algorithm presented in reference [1] and the software ClustalW to three sets of data. The splicing results are shown in Table 2, and graphed in Fig. 12.
Comparison of SP Value
Comparison of SP Value
In order to measure the effect of multiple sequence alignment, reference [1] proposed the measure method. As is shown in reference [1], the SP (sum of pairs) value means the sum of scores of the comparison between each pair of sequences. The sum of scores of the comparison between each pair of sequences means that if the two characters in the corresponding position are the same, then the score is 0, and if not, the score is 1. So, the fact is that the lower the value of SP, the better effectiveness of comparison is. So, from Fig. 11 and Table 1 we conclude that because of the improvement of center star alignment algorithm, the accuracy of our algorithm is better than [1] and ClustralW. As can be seen from Table 2, for the three sets of data, the results of our algorithm have more than 10,000 pairs matched bases than other two methods.
The running time of our algorithm [1], and the ClustalW are shown in Table 3 and Fig. 13.
Comparison of running time (s)
Comparison of running time (s)
Comparison of running time(s).
From Fig. 13 and Table 3 we conclude that because of the two times computation of center star sequence, the running time of our algorithm is not fast than [1], but fast than ClustralW. In addition, the calculation time of reference [1] is obtained by the cost of the accuracy. The reference [1] used a kind of k-band method to align two sequences approximately, so it saved time but lost accuracy.
In this study, we developed a multiple sequence alignment algorithm based on improved center star alignment in MapReduce. The algorithm takes full advantage of the MapReduce parallel programming framework. The map and reduce function we designed can not only collect the same k-mers when constructing the center sequence
The advantage of our algorithm is implementing the most significant research content of biology – multiple sequence alignment in the big data platform. It provides a good solution to the exponential growth of biological information data. It makes many things which stand-alone could not do possible.
The shortage of our algorithm is that the data set is relatively small, while the start-up of Hadoop system consumes a long time. Maybe in 404s’ running time, the start-up time of Hadoop system occupies most. When the amount of data continues to increase, the time advantage of our algorithm will become more and more obvious. Next, we will actively collect appropriate more large-scale data sets to verify our conjecture.
Another shortage of our algorithm is only calculating the mitochondrial data, while most of the existing parallel software is based on protein data as experimental data. Next, we will select a representative benchmark protein sequences as experimental data to assess the accuracy and efficiency of our algorithm.
In conclusion, The algorithm adapts improved center star align strategy so as to work accurately and makes full use of the specialties of data analysis in MapReduce when assembling center sequence and matching the maximum sub strings of two sequences. This algorithm is an innovative solution for DNA and protein multiple sequence alignment and analysis. The following conclusions can be drawn from our study: (1) Assembling of center sequence can make full use of map function and reduce function; (2) Searching and matching maximum sub string can make full use of map function and reduce function; (3) MapReduce is an ideal solution for multiple sequence alignment algorithm.
Footnotes
Acknowledgments
This research was financially supported by Chinese Natural Science Foundations (61363016, 61063004), Key Project of Inner Mongolia Advanced Science Research (NJZZ14100), Inner Mongolia Colleges and Universities Education Department Science Research (NJZC059), Natural Science Foundation of Inner Mongolia Autonomous Region of China (NO. 2015MS0605, NO. 2015MS0626, NO. 2017MS0605 and NO. 2015MS0627) and Ministry of Education Scientific Research Foundation for Study Abroad Personel [2014] 1685.
