Abstract
The DNA fragment assembly (DFA) problem is among the most critical problems in computational biology. Being NP-hard, it can be efficiently solved via meta-heuristic algorithms, such as the gravitation search algorithm (GSA). GSA is a state-of-the-art swarm-based algorithm particularly suitable for solving NP-hard combinatorial optimization problems. This paper proposes a new memetic GSA algorithm called MGSA. MGSA is a type of overlap-layout-consensus model that is based on tabu search for population initialization. In order to increase the diversity of MGSA, we adapted two operator time-varying maximum velocities in the GSA procedure. Finally we also adapted the simulated annealing-based variable neighborhood search (SA-VNS) to find superior precise solutions. The proposed MGSA algorithm was verified with 19 DNA fragments based on seeking to maximize the overlap score measurements. In comparing the performances of the proposed MGSA and state-of-the-art algorithms, the simulation results demonstrate that the MGSA can achieve the best overlap scores.
Keywords
Introduction
A single-stranded molecule of deoxyribonucleic acid (DNA) comprises a sequence of four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). The DNA Fragment Assembly (DFA) problem constitutes the task of rebuilding a DNA sequence from a set of DNA fragments. Complete DNA sequences are frequently longer than 3000 base pairs (bps). Important industrial applications of DFA are DNA cryptography [1] and gel electrophoresis [2]. The DFA problem has been proven as NP-hard [3] and inductive, as with the asymmetric traveling salesman problem [4].
DFA algorithms can be divided into two categories: overlap layout consensus (OLC) and de Bruijn graph (DBG) models. Since its proposal by Sanger et al., [5], the OLC model has become increasingly important and complex, and has been recently applied to sequencing newly discovered genomes [6]. For the DFA problem, the OLC usually adopts a greedy approach, such as PHRAP [7], GAP [8], TIGR Assembler [9], CAP3 [10], STROLL [11], Celera Assembler [12], or ARACHNE [13]. Be that as it may, although the greedy strategy successfully solves small or medium sequences by merging pairs of sequences with high overlap scores, it is less adept at assembling larger DNA sequences.
Recently, meta-heuristic algorithms [14] increasingly common popular for solving large NP-hard combinatorial optimization problems. Consequently, they have been incorporated into OLC algorithms to find the best overlapping or re-assembly of all DNA fragments. Such algorithms include hierarchy clustering [15], simulated annealing (SA) [16, 17], tabu search (TS) [18], genetic algorithms (GA) [19–25], ant colony optimization (ACO) [26–28], artificial bee colony algorithms (ABC) [29], particle swarm optimization (PSO) algorithms [30–34], artificial immune system (AIS) [35], bee algorithm [29], cuckoo algorithm [36, 37], firefly algorithm [38], grid algorithms [39], hybrid algorithms [17, 41], iterative algorithms [42, 43], and LHK algorithm [44]. The second broad DFA category embraces DBG models [45]. Differing from the OLC models, DBG models are based on K-mer graphs, searching DBGs and compressing redundant sequences [46, 47]. For the DFA problem, DBG models are solved by the greedy-algorithm EULER [48], parallel strategies [49, 50], maximum matching [51], dynamic hashing [52], exspander algorithm [53], generic algorithm [54], MaSuRCA assembler [55], MapReduce approach [56], and IWP methods [57]. The OLC and DBG models are compared in detail in [47]. However, this paper goal is to find the best overlapping among all fragments and ignore the issue of computation time. Recently, some new techniques [58–60] have been proposed that offer the compression or selection feature to avoid over-fitting.
In contrast to the OLC and DBG models, the gravitation search algorithm (GSA) [61] approaches to the DFA problem have received little attention. In other words, difference with the other swarm-based algorithms such particle swarm optimization [62]. The GSA is memory-less and only the current position of the agents plays a role in the updating procedure, which can help reduce the premature convergence problem. Thus, this paper proposes a memetic GSA (MGSA) approach based the OLC model that solves DFA problems by maximizing the overlapping score measurement. GSA is rendered suitable for DFA problems by introducing the smallest position value (SPV) rule [63], which converts a continuous number to a DNA sequence. The proposed MGSA initializes the population based tabu search [64]. In order to increase the diversity of MGSA, we adapted two operator time-varying maximum velocities in the GSA procedure. Lastly, the current best solution is fine-tuned by coupling the SA method [65] to a local search method called variable neighborhood search (VNS) [14]. The proposed method was experimentally validated with 19 different DNA fragments (or instances). The overlap scores generated by MGSA were effectively improved, compared with those of existing algorithms. The remainder of this paper is organized as follows. Section 2 provides background information and discusses related studies. The proposed MGSA algorithm is presented and evaluated in Sections 3 and 4, respectively. Brief conclusions and ideas for future studies are presented in Section 5.
Background knowledge and related studies
This section provides the background of this research and discusses the related literature. The overview focuses on five main topics: Memetic Algorithm The DNA Fragment Assembly Problem (DFA) Gravitation Search Algorithm (GSA) Tabu Search (TS) Simulated Annealing (SA) Particle Swarm Optimization (PSO)
Memetic algorithm
This section introduces a memetic algorithm for the evolutionary computation in this paper.
The term memetic was coined by Dawkins [66], in which the word “meme” denotes tunes, ideas, catchphrases, clothing fashions, and styles of making pots or building arches. Just as genes propagate in the gene pool by leaping from body to body via sperms or eggs, memes propagate in the meme pool by leaping from brain to brain via a process that, in a broad sense, can be called imitation.
Therefore, the memetic algorithm [67–69] attempts to mimic cultural evolution through stochastic global search meta-heuristics algorithms. The main concept underlying the memetic algorithm is hybridization of a class of meta-heuristic and local search algorithms in order to reduce the time taken to obtain good quality solutions to a variety of NP-hard optimization problems.
The DNA fragment assembly problem
DNA sequences over one hundred bps cannot be sequenced accurately and rapidly, and must first be randomly divided into several fragments. The original DNA sequence is then reconstructed from these fragments; a technique called shotgun sequencing. The OLC model for DFA is a three-step approach, as described below. Overlap (O): Find a common sequence among all possible pairs of fragments. Layout (L): Determine the order of fragments from their overlap scores. The overlap scores are based on sequence alignment approaches [70], including global alignment, semi-alignment and local alignment. Consensus (C): Derive the consensus sequence from the layout step.
The DFA approach to DNA sequencing is illustrated in Fig. 1. First the DNA strand to be sequenced is randomly cut into four fragments. The algorithm then finds the overlaps and determines the layout of each fragment, and finally re-constructs the original DNA sequence [33].
The quality of the consensus is computed using Equation 1, where n denotes the number of fragments, while the coverage measures the mean redundancy of the fragment sequences. The higher the coverage, the higher the overlap scores and the smaller the number of sequence gaps.
The gravitation search algorithm is a recently proposed meta-heuristic algorithm for modeling swarm populations. Pioneered by Rashedi et al. [61] in 2009, the algorithm is inspired by Newton’s laws of gravity and mass motion. Differ from the PSO approach [62], interactions among individual agents (mass aggregates) are governed by gravitational forces and the laws of motion. Since gravity guides the agents toward heavier masses, mass is regarded as a performance measure, with heavier masses indicating better solutions.
Given A agents whose positions and velocities are denoted by X and V, respectively, the solution of a particle i in iteration t is defined as Equations (2) and (3).
Each alternate agent represents one solution. Initially, the GSA assigns a random solution to each agent.
To evaluate the current population fitness, the mass of each agent is calculated using Equations (4)–(7).
The acceleration of each agent is determined from the gravitational law applied to each agent:
Subsequently, the searching strategy of each agent in the next iteration is governed by Equations (9) and (10).
To control the search accuracy, the gravitational coefficient G is not constant; instead, it is a decreasing function of time.
The remaining important GSA notations are listed below: : Fitness value of agent i in iteration t. best
t
: Best fitness value among all agents in iteration t. worst
t
: Worst fitness value among all agents in iteration t. : Mass of an individual agent i. : Average mass of agent i in iteration t. : Euclidean distance between agents i and c. : Force acting on each agent i in dimension j in iteration t. Kbest: the K fittest agents (agents with highest gravitational mass). ɛ: A small constant, introduced to avoid division by zero. G
t
: Gravitational coefficient in iteration t. : Acceleration of each agent i in dimension j in iteration t. α: Shrinking constant.
Tabu search (TS) is a meta-heuristic algorithm used for combinatorial optimization problems and was first proposed by Glover et al. [64]. The motivation for TS comes from the visited solutions and repeated visits of local search approaches, and is used to solve large, complex combinatorial problems. The most important strategy is to create a short-memory structure by TS, to record forbidden moves that do not improve the solution. This type of short-memory structure is called a tabu list.
Simulated annealing
The simulated annealing (SA) algorithm is a meta-heuristic method proposed by Kirkpatrick et al. [65]. The early concept of SA was derived from the annealing of a solid, as proposed by [71]. The SA algorithm is widely applied to solve complex combinatorial problems by using a perturbation searching strategy.
Particle swarm optimization
Particle swarm optimization (PSO) is a population-based swarm intelligence algorithm that was first presented by Kennedy and Eberhard [62]. The main concept of PSO is that each particle updates its current solution to reference with its own history experiences as well as the experiences of others, as shown in Equations 12 and 13.
This paper presents an MGSA algorithm for solving DFA problems. The solution of one agent is initially obtained by tabu search algorithm; the remaining agents are assigned random values. The best global solution is then computed by SPV rules and the evolutionary GSA algorithm; this process iterates until the stopping criterion is reached. The SA-VNS local search improves the quality of the best global solution generated by the GSA algorithm.
Objective function
The DFA problem is solved by discovering the consensus sequence and objective that maximizes the overlapping scores of all fragments. The DFA problem is formally described below.
Given a n of fragments f = {f0, f1, f2, …, f n }, the overlap score between two fragments i and j is wi,j.
In this study, we adopt the objective function proposed by Parsons et al. [72], given by:
The swarm based meta-heuristic GSA algorithm was originally designed for continuous problems. However, since the encoding format of each agent cannot be permuted, it is not directly compatible with the DFA problem. To enable a direct mapping of the agent position to the DNA sequence, we adopt a heuristic approach called SPV [63].
Initial population
In general, the population is randomly initialized according to Equation (15).
To guarantee that the initial population obtains a diverse and high-quality solution, one of the agents adopts the TS algorithm, which maps sequence to position values following the SPV rule. In Table ??, dimension 3 has a rank value of 1, so its position becomes – 0.79. The rank value of dimension 1 is 2, so the position is reassigned as – 0.39. By similarly assigning the remaining values based on their ranking order, the SPV rule transforms the DNA sequence into the position values [– 0.39, 0.22, – 0.79, 0.13, 0.15].
Each agent moves in a direction governed by its gravitational mass. The gravitational masses and accelerations are formulated by Equations (11) and (8), respectively. According to Equation (8), Kbest is the fittest K agent with the largest gravitational mass. Furthermore, in the original GSA algorithm, K is initialized to the number of agents A, and decreases linearly to 1. In this paper, we imported a new operator to update the value of K by following Equation (16) [74]:
During successive iterations, each agent searches for a solution space and updates its own solution according to Equation (9). This paper assumes a time-varying maximum velocity that adopted from a previous study [75]:
Once the position values of each agent have been updated by Equation (10), the new DNA sequence is rearranged using the SPV rule, as shown in Table ??.
VNS [14] is a local search strategy that improves the quality and ensures suitable development of a meta-heuristic method; further, it can also ensure adequate diversity of the meta-heuristic solutions. VNS uses four common neighborhood perturbation operators, namely swap, insertion, inversion, and displacement, as described below. For a survey of VNS applications, the reader is referred to [14]. Swap: Randomly choose two different positions from the permutation sequence and swap them. Insertion: Randomly choose two different positions from the permutation sequence and insert the front sequence ahead of the back one. Inversion: Invert the subsequence between two random positions in the permutation sequence. Displacement: Randomly select one subsequence and one position, and insert the subsequence before the chosen position.
In this paper, the GSA procedure is performed before the SA algorithm is implemented in the VNS local search. This ordering increases the local searching ability and avoids premature convergence. Importing the SA into the VNS process maintains appropriate balance between exploration and exploitation at different temperatures.
The proposed MGSA algorithm
A flowchart of the MGSA algorithm is presented in Fig. 3. Initially, the solution of one agent is found by TS algorithms, while the remaining agents are assigned random values. The global solution is then repeatedly optimized by the SPV rules and the evolutionary GSA algorithm until the stopping criterion is reached. Local searching by SA-VNS improves the quality of the best global solution generated by GSA.
Time complexity of MGSA
The MGSA algorithm can be simply divided into three parts: the initial population part; the GSA procedure part; and, SA-VNS local search part. The complete time complexity of the MGSA algorithm is analysed in the following.
Parameter setting
Assume there are A agents, n fragments, I number of iteration. In this paper, we used the quick sort to design the SPV-Rule; thus, the worst case SPV-Rule is O (n2).
Semi-global alignment
Assume the sizes of sequence S1 and S2 is equal to j and k, respectively. Accordingly, the complete time complexity of semi-global alignment is O ((n - 1) * jk)) = O (njk).
Initial population
First, one agent is obtained from the TS within 20 iterations, and the remaining solution agents are generated randomly and by the SPV rule. The time complexity of TS is O (20 * (n - 1 (jk)) = O (njk).Thus, the time complexity of the initial population is equal to O (njk) + O (A - 1 (n2)) = O (An2).
GSA procedure
Second, F, M, a, and the SPV-rule need to be found once in the GSA procedure. The F needs to calculate the force of each two agents, for which the complexity is O (A (An)) = O (A2n), the M needs to process the semi-global alignment of all agents once, then find the best and worst fitness values once; and, for the one normalization, the complexity is O (A (njk)) + O (A) + O (A) + O (A) = O (A (njk)), where the a is O (A). Thus, the complete time complexity of the GSA procedure is O (IA2n) + O (IA (njk)) + O (IA) + O (IAn2) = O (IA (njk)).
SA-VNS local search
Assume the SA-VNS has a T iteration, and the time complexity of SA-VNS is O (T (njk)).
According to the three parts of the time complexity outlined above, the time complexity of MGSA is O (An2) + O (IA (njk)) + O (T (njk)) = O (IA (njk)). Further more, the time complexity of the original GSA is as same as the O (IA (njk)).
Experimental results
This section evaluates the performance of the proposed MGSA algorithm. The test items are 19 benchmark DNA fragments from the NCBI (http://www.ncbi.nlm.nih.gov/), provided in shotgun format with fragment overlaps. The DFA benchmarks are generated by GenFrag [76], which splits the DNA sequence into different instances and provides a set of overlapping fragments as shown in Table 1. The experimentation was performed on a computer with an Intel Quad core Q9400 CPU operating at 2.66GHz and with 2GB of memory running Microsoft Windows 7.
Parameter settings
In order to verify the GSA, the parameter settings for all experiments were the same, as shown in 2.
All experimental results were run in 30 independent runs, with each run executed in 1000 iterations. The performance was assessed based on the average of the overlap score and computation time, for which the symbol t avg denotes the average computation time in seconds, and where the match score is 2, the mismatch score is 0, and the gap penalty is 2 [73].
Comparison of MGSA with SGSA, GSA,DSAPSO, and TPSOSV
To validate the effectiveness of the MGSA algorithm, in this paper we also propose a variant of GSA, named SGSA. In addition to the SGSA, the original GSA and two PSO-based algorithms are compared in this section. The following abbreviations represent the five algorithms considered:
Comparison of the overlap scores
The computational results are shown in Table 3. As can be seen, the majority of the MGSA overlap scores are better than SGSA, except for the benchmark TNFRSF19(4), TNFRSF19(7), X60189(4) and X60189(7). The experimental results indicate that the GSA algorithm with the initialization TS algorithm and the SA-VNS local search is more robust than the SGSA. On the other hand, results indicate that the MGSA outperforms the standard GSA, DSAPSO, and TPSOSV in average overlap of all instances is 10% , 75% and 62% , and especially outperform at almost 80% in instances within the BX842596(4), BX842596(7), M15421(5), M15421(6), M15421(7) and NC001807(4), and NC001807(7). Thus, from the above simulation results, it is concluded that MGSA is more effective than SGSA, the original GSA, and the other two PSO-based algorithms. It should be noted that the SGSA also outperforms the DSAPSO and TPSOSV.
Comparison for the computation time
Although the MGSA can outperform the other compared algorithms in overlap scores, the trade-off is computation time. Table 6 shows that the computational time of the MGSA is longer than the SGSA, the original GSA, DSAPSO, and TPSOSV by about 37% , 48% , 107% , and 32% , respectively. Further, based on Table 3, we can find that the SGSA is longer than the TPSOSV by only 2% computation time, but can achieve almost 40% overlap score. Thus, from the above simulation results, it is concluded that the GSA-based algorithm can outperform than the PSO-based algorithm in both overlap score and computation time.
Comparison of results obtained by MGSA and PH-PALS
In the previous subsection, the MGSA algorithm was demonstrated to provide the best solutions. Therefore, the MGSA was selected for a comparison with another state-of-the-art algorithm, named PH-PALS [17] also with 19 benchmarks. The comparison results are described in Table 5, the results of which indicate the MGSA can get the better overlap scores than the PH-PALS algorithm in most of benchmarks.
Statistical verification
Table 6 reports the two-sided Wilcoxon rank sum tests [77] of the MGSA, SGSA, GSA, DSAPSO, TPSOSV, and PH-PALS algorithms at the α = 0.001 significance level. Significant differences are indicated by +. Table 6 reveals that the superior performance of the MGSA is statistically significant.
Conclusions and future studies
This paper proposed a memetic Gravitation Search Algorithm (MGSA) algorithm for solving DFA problems. This algorithm converts continuous position values into job sequences by an SPV rule, initializes the population by a tabu search, and adopts simulated annealing using VNS as the local search method. These adaptations provide a balance between exploitation and exploration. The simulation results demonstrate that the pro- posed MGSA optimizes the overlap score of 19 bench- mark instances. One drawback of the MGSA is its higher computational time costs than the existing algorithms. In future studies, we will consider two approaches for reducing the computational time: DNA sequence compression, fuzzy entropy, and adapting our MGSA approach to suit the de-Bruijn-graph (DBG) model.
Footnotes
Acknowledgments
This work was supported in part by the National Science Council, Taiwan, R.O.C., under grants MOST 103-2221-E-006-145-MY3, and MOST 103-2221-E-006-181.
