Abstract
Abstract
Finding short patterns with residue variation in a set of sequences is still an open problem in genetics, since motif-finding techniques on DNA and protein sequences are inconclusive on real data sets and their performance varies on different species. Hence, finding new algorithms and evolving established methods are vital to further understanding of genome properties and the mechanisms of protein development. In this work, we present an approach to finding functional motifs in DNA sequences in connection to Gibbs sampling method. Starting points in the search space are partly determined via graphical representation of input sequences opposed to completely random initial points with the standard Gibbs sampling. Our algorithm is evaluated on synthetic as well as on real data sets by using several statistics, such as sensitivity, positive predictive value, specificity, performance, and correlation coefficient. Additionally, a comparison between our algorithm and the basic standard Gibbs sampling algorithm is made to show improvement in accuracy, repeatability, and performance.
1. Introduction
A
TFs recognize sequence sites in DNA that give favorable binding energy, meaning a sequence-specific pattern, usually 8–20 base pairs long. Therefore this pattern, otherwise called a “motif,” is relatively well conserved in composition and is inscribed in the DNA as a variation of a sequence of four unique bases: adenine (A), thymine (T), cytosine (C), and guanine (G). Part of the DNA codes for assembly of amino acids, forming a polypeptide chain. The genetic code is read in a sequence of three bases called triplets on DNA, codons on mRNA, and anticodons on transfer RNA. There are 43 = 64 triplets, and this is the smallest combination of four bases that could encode all 20 amino acids. The pioneers for the triplet code were Francis Crick and Sydney Brenner (Crick et al., 1961), who made an experiment that examined the effect of frame shift mutations on protein synthesis. Since then there was enough compelling evidence to infer that the genetic code is, in fact, a tree-based code (Smith, 2008).
To find the positions of motifs in DNA, various motif-finding algorithms were developed, and we can classify them into three categories: those that use promoter sequences from coregulated genes from a single genome, those that use orthologous promoter sequences of a single gene from multiple species (phylogenetic footprinting), and those that use both (Das and Dai, 2007). Further classification based on algorithm design places algorithms broadly into word-based algorithms (van Helden et al., 1998), probabilistic algorithms, and machine-learning techniques (Liu et al., 2006).
The probabilistic approach searches for the solution locally and represents the motif model by position weight matrix. One of the first implementations of this probabilistic scheme was with a greedy algorithm by Hertz et al. (1990), which evolved into a CONSENSUS algorithm (Hertz and Stormo, 1999). They introduced a method that evaluates an alignment through information content score based on large-deviation statistics. However, most probabilistic approaches use statistical techniques such as expectation–maximization (EM) methods and Gibbs sampling strategy. The former were used for identifying multiple motifs in biopolymer sequences with the MEME algorithm developed by Bailey and Elkan (1995).
Gibbs sampling, like the EM method, is an iterative procedure, where the results of every step depend solely on the result of the previous one. Unlike the EM method, the selection for the next step is not deterministic but rather based on random sampling, i.e., random numbers. Gibbs sampling for motif finding was first introduced by Lawrence et al. (1993), and since their publication various modifications to the algorithm were made, for instance, modeling for spaced dyads and motifs with palindromic patterns (Liu et al., 2001) using a probability distribution to estimate the number of copies of the motif per sequence and incorporation of higher order Markov-chain background model (Thijs et al., 2002), sequence weighting (Chen and Jiang, 2006), direct optimization of information content during sampling (Defrance and van Helden, 2009), among others.
In the subject of motif-finding algorithms, a term motif represents the conserved pattern, which is repetitive in the DNA strand. Thus a simple formulation of the problem is the following: in a given finite set of sequences, find an unknown motif that is over-represented. Statistically we can classify the motif-finding problem as a problem of missing data. The position of the motif element in a sequence, i.e., the elements of the alignment vector, can be treated as missing data (Gupta and Liu, 2006; Das and Dai, 2007; Liu et al., 2002).
In this article, we look for a solution differently by adding a preprocessing step before starting the iteration of the Gibbs sampler. Although the algorithm is based on Gibbs sampling strategy (Lawrence et al., 1993), we use graph representation of the sequences to help us project the local extremes in algorithm's search space. The sampling strategy is explained in more detail in the following section 2 and its application to multialignment problems in DNA sequences in section 2.1. In section 2.2, we describe the addition of graph representation and mention the points of difference from the original algorithm. In section 3, we detail the synthetic data sets and real data sets, which were gathered by Tompa et al. (2005), on which we tested our algorithm. The results are gathered in section 4, where we use statistics such as performance and correlation coefficient, sensitivity, and positive predictive value on nucleotide and site level to show the performance of our algorithm, which we named GraphGibbs. We conclude this article with section 5, where we summarize our findings and give some insight to future work.
2. The Method
The base of our algorithm is the original Gibbs sampling strategy for multiple alignment, as was introduced by Lawrence et al. (1993). We refer to it as the basic algorithm in the following text for simplicity, and we describe it briefly in the next subsection. The modifications we made are outlined in section 2.2.2.
Gibbs sampling is a particular case of Metropolis-Hastings algorithm, which, in turn, is an instance of a Markov chain Monte Carlo method (MCMC). The goal of the algorithm is to best approximate the distribution of unknown parameter, otherwise known as the target distribution. In our context, the unknown parameter is the alignment vector. The basic idea of how to approximate the target distribution is to construct a chain of states until convergence is reached. In construction, we invoke the Markov property, where the next state the chain visits, although determined by a probabilistic choice, depends only on the current state (first-order) or, in general, on the last m states (m-order). The starting point of the chain is usually arbitrary configuration of the unknown parameter (Resnik and Hardisty, 2010, Liu and Logvinenko, 2007).
2.1. Basic algorithm
Let K denote the number of residues in the alphabet of the sequences. The input of the basic algorithm is a set
The algorithm is initialized by choosing random starting positions within the various sequences, thus choosing a random starting motif alignment
In the update step, one of the sequences, Sz, is chosen, either at random or in specified order. Then the elements of position weight matrix
where ci,j is the count of residue j at position i in the motif, bj is the pseudocount of residue j, and
In the sampling step, every segment x of width W within sequence Sz is considered as a possible instance of the motif. According to position weight matrix
The most probable alignment is chosen with the following score:
where ci,j and qi,j are calculated from the complete alignment.
The authors Lawrence et al. (1993) also addressed the possibility in which a user parameter is a range of plausible motif widths, and let the algorithm decide the optimal one. They studied a statistic-named information per parameter
where p is the number of free parameters,
2.2. Addition and adaptation
In this section, we describe our addition to the basic algorithm. It is a preprocessing step of the data gathered solely from DNA sequences that partly defines the initial point of iteration process. We apply a graph model constructed by given DNA sequences and analyze its adjacency matrix to determine motif candidates that are over-represented in the set of sequences. We also overview the modifications made to the basic algorithm.
2.2.1. Addition
Because Gibbs sampling conducts the search locally through parameter space, the starting point in the search space is not so trivial as it affects the speed of convergence of the sampler. The downside of searching only parts of the space, although through a probabilistic choice, is that the algorithm can get stuck in a local optimum or plateau. One solution would then be to make periodical shifts in the search space. Another improvement is the approximation of a possible starting point for the Gibbs sampler, instead of the random initialization starting values for unknown parameters. We consider both approaches in our algorithm.
Since the DNA code works with triplets as its base blocks, the intuition is to code the given set of sequences into triplets and use Gibbs sampling strategy on such coded sequences. The reading of triplets in the original sequences was done by moving one residue to the right. For instance, the sequence of AATCGTGGC is coded as AAT, ATC, TCG, CGT, GTG, TGG, GGC. Since we have a DNA alphabet of four residues, the number of triplets is 43 = 64.
We take the input sequences and concatenate them into one supersequence R. Then we code the sequence R as described above and project it onto a directed multigraph G of order 64 (West, 2001). The vertices in G represent the triplets and are labeled by numbers 1 : 64, one number per triplet. The edge between two vertices in G reflect the adjacency between two consecutive triplets in R. Therefore, the number of edges between two vertices match the number of times the corresponding triplets are sequential in R. Each edge also has a direction, since we read the sequence R from left to right as the order of the triplets is important.
Once the sequence R is read and graph G constructed, we analyze the adjacency matrix of G. First we normalize it and then we look for the highest value in each row that can appear in more than one column. The reason is that the higher the number of directed multiple edges is between two vertices, the more frequent is the occurrence of sequence formed by these vertices inside the sequence R. We form (0, 1)-matrix, where the highest value of normalized adjacency matrix is represented by 1, and all other values are replaced by 0.
We sort the vertices in nonincreasing order per highest normalized values and select the front 10
1
vertices
We search for all walks of desired length with a modified breadth-first search algorithm (Cormen et al., 2001). While doing a systematic search of discoverable vertices from the source vertex vs, the breadth-first algorithm also produces a breadth-first tree of size ℓ with root vs that contains all reachable vertices. The modification makes the vertices discoverable more than once, opposed to at most once as in the basic breadth-first search algorithm. This way loops in the graph are also considered. Alongside the tree construction we also count the in-degree of each of the vertices. Then we consider the vertices of the highest in-degree together with their neighbor as a possible motif candidate.
From this motif candidate we build the motif in the following way: first we count the number of occurrences of the candidate in the input sequences. If their occurrence was at least N/2 times, then we would add the residues to each end of the candidate, until the number of motif occurrences is lower than N/2. In the case of undefined width W, the preferential feature is the motif width versus the number of occurrences. Therefore, the longer motif with less motif sites would be considered for further analysis, mainly because the shorter motifs are clustered inside the longer one and thus are not discarded. Once the motif is built we store the motifs positions in the alignment vector or matrix.
To help visualize the process, a representation of a (0, 1)-matrix is drawn in Fig. 1. The vertices are labeled with numbers 1 : 64, where each number represents a triplet. In this example, the motif width is W = 12. Then the length of the walk is ℓ = 3. One of the walks with source vertex 58 is drawn in blue color.

Graph representation of (0, 1)-matrix with one walk emphasized with blue color (vs = 58).
This forms part one of our algorithm. When the vector a or matrix (in case of multiple sites per sequence) of the constructed motifs is not full, the second part of the algorithm, the Gibbs sampler, is executed. The first adaptation of the basic algorithm is the use of the site vector/matrix a to be the starting point of the Gibbs sampling algorithm. The algorithm pseudocode is given in Algorithm 1.
Pseudocode of the GraphGibbs algorithm.
2.2.2. Adaptation
The resulting alignment, determined by the parameter I, is displayed with triplets and with residues of the original DNA alphabet. In each case, the algorithm displays two motif consensuses. For one consensus, the triplet/residue with the highest probability (given by
Another slight change is the way a new value of az in the temporary exempt sequence Sz is allocated in the sampling step. In our version, the value of az of the alignment vector is the start of the segment x that has the highest weight Ax, as opposed to random choosing. In the case of multiple motif sites per sequence, the sites of segments with the highest weights are chosen as the new sites in the alignment vector. The assumption is that after some steps of the Gibbs sampler the pattern should emerge by itself and would be reflected in the position weight matrix correspondingly. The weights of the segment mirror this evolution of the matrix, thus the most probable segments are chosen toward the final alignment.
3. Data
We tested our algorithm on several different cases of data. We considered the real data as well as synthetic data with known motifs. The real data sets used to test the algorithm are a part of the freely accessible benchmark data ensemble constructed by Tompa et al. (2005), which was analyzed with 13 different motif-finding algorithms according to their article. They build the data sets from real sequences gathered from TRANSFAC database. There were four species considered: yeast, fly, mouse, and human, resulting in 56 data sets. From yeast they constructed 8 data sets, 12 from mouse, 6 from fly, and 26 from human.
We also generated our own data sets, since they provide a well-controlled environment for evaluation of an algorithm. Because the motif sites are known, the evaluation statistics, sensitivity, and positive predictive value, among others, can be accurately computed, in contrast to real data sets, where some effective binding sites, even if predicted, could have escaped experimental detection and so are not considered as viable solutions. This makes predicted motifs harder to classify, as we cannot know whether the pattern is a part of the polynucleotide tract we can filter out, or it represents a possible functional motif. Even if predicted motifs are harder to classify, our algorithm will, under loose restraints, return all possible patterns (either consistent or nonconsistent) it finds in a set. Although, a reader should note that those patterns are statistically over-represented in the set and, if nothing more, could bear a closer examination of their viability. However, that distinction we leave to the user.
The synthetic data can be further divided into two groups. In each group the motifs are of known widths 9 and 12. In one group, these motifs are consistent, meaning the same pattern is repeated through all N sequences. In the other group, the pattern is inconsistent, allowing for different degrees of residue variation d(v). The highest degree of variation has value 3, applied in sets with motif width 12. In summary, we generated data sets per five different parameters, which are: the motif width
4. Results
Each of the data sets was analyzed by the algorithm 10 times with all parameters fixed at values for the widest solution scope. In other words, the algorithm decides the number of motifs, their widths, and the number of occurrences per sequence by itself. These specifications were chosen to make the evaluation of algorithm performance equivalent across all data sets, since all data sets were run under the same conditions. Here we wanted to display how many of known motifs our algorithm can find in a set and how good these findings are. We condensed the results into several groups for clearer representation. In the results that follow we presented our algorithm's performance on these different collections to show how good the algorithm can detect known motifs and how distinctive that detection is.
4.1. Statistics
For each set, we have a list of known motifs with their positions in the sequences and a list of predictive motifs with their respective positions. We measured the algorithm performance using evaluation statistics on nucleotide and site levels. At the site level, we look for the overlap between the known and predicted motif positions. Accepted overlap is at least a quarter of known motif width. At this level we merely count the number of overlaps. At the nucleotide level, on the other hand, we determine how big the overlap actually is. Then we can define the common scores of true and false positives, along true and false negatives at nucleotide level as follows (Tompa et al., 2005):
• nTP—the number of nucleotide positions in both known and predicted sites, • nFN—the number of nucleotide positions in known, but not in predicted, sites, • nFP—the number of nucleotide positions not in known, but in predicted, sites and • nTN—the number of nucleotide positions in neither known nor predicted sites.
Analogously, at site level we can define:
• sTP—the number of known sites overlapped by predicted sites, • sFN—the number of known sites not overlapped by predicted sites, and • sFP—the number of predicted sited not overlapped by known sites.
With these scores so defined, we can compute the following statistics:
i. Sensitivity:
ii. Positive predictive value:
At nucleotide level we can further compute:
iii. Specificity:
iv. Performance coefficient:
v. Correlation coefficient:
At site level one can also look at the average performance, that is:
4.2. Synthetic data
Results from the generated data sets were grouped according to motif width W, number of sequences in the data set N, and by motif consistency, where c represents a consistent motif and nc a nonconsistent motif. We labeled the group in a format of W_N_c/nc, for simplicity.
In each data set, we averaged the scores nTP, nFN, nFP, nTN, sTP, sFN, and sFP by number of repetition of a data set and calculated the statistics described in the previous subsection on both site and nucleotide levels. The results were then sorted in their respective group, where a weighted average for each of the seven scores was taken to calculate the statistics on both levels. In Fig. 2 below, the values of these statistics for each group are shown by bar chart on nucleotide level and by lines on the site level. Alongside eight distinctive groups we also show the values of statistics averaged across all analyzed data sets.

Combined representative statistics over eight groups of synthetic data.
The algorithm has shown good performance on sets with consistent motifs as expected, since the first part of the algorithm specifically searches for consistent, statistically over-represented motifs as candidates for the second part of the algorithm. Nearly all scores on both levels are at value 1, showing perfect correlation and high predictive power of the algorithm on sets containing consistent over-represented motif.
Naturally, more variability can be seen in groups with different degrees of variation. However, the nucleotide positive predictive values of all four groups, i.e., 9_10_nc, 9_20_nc, 12_10_nc, and 12_20_nc, are above 0.5 in two cases closer to 0.8. Therefore, there are not a lot of false positives among the predictions. Nucleotide sensitivity has slightly lower values comparing to nPPV but no lower than 0.5 in all four groups. Hence, the algorithm correctly predicted at least half of known motifs in all of the data sets.
Both of the correlation coefficients, nPC and nCC, are positive as well, giving a positive correlation between predicted and known sites. This is reflected on site level as well, since sSn and sPPV are both above the 0.6 threshold. The latter of the two has slightly higher values across all sets, meaning that the algorithm results in fewer false positives than false negatives.
4.3. Real data
To analyze the performance of the algorithm on real data, we chose the sets of type real of all four species from the data ensemble gathered by Tompa et al. (2005). The results from analyzed data sets were grouped by species, and the respective scores and statistics were calculated as with synthetic data. This way we gained four groups, and their statistics were computed from the weighted averages of the scores nTP, nFN, nFP, nTN, sTP, sFN, and sFP. Their values are drawn in Fig. 3 in the forms of a bar chart for nucleotide level and lines for site level.

Combined representative statistics over real data sets grouped by species.
The sensitivity on both levels has higher values compared to positive predictive values. This implies that the algorithm makes more false negative predictions; that is, it does not identify all of the known nucleotide positions. These results, however, are a bit skewed because of the nature of the algorithm. With GraphGibbs algorithm the candidate motif tends to be of shorter width, since consistency in motifs is not occurring randomly in the set and drops radically when the width of the candidate motif increases.
Analogously the positive predictive values on both levels are a consequence of the algorithm setting, since the number of expected sites is uniform for all sequences. Thus, the algorithm predicts more sites compared to the actual number of occurrences per sequence of known motifs. Here one should also take into consideration that with real data the known motifs are experimentally determined and some motifs still need to be classified. Therefore, the number of false positives may be overstated, since the result can show an unknown functional motif. The interpretation we leave to the user.
Two statistics that reflect the algorithm's performance with more moderation are the correlation coefficient nCC and the average site performance sASP. The best results in the real data sets were for fly and yeast species. The former is more surprising, since the predictions on this set seems to be poorer compared to other sets (Tompa et al., 2005).
4.4. Additional statistical analysis
In Fig. 4 we show the specificity on nucleotide level, which is calculated from a weighted average of nucleotide scores of all data sets per group. It shows that true negatives are properly defined in most groups. There is an understandable dip with real data sets, however, it is not considerable. Here it is shown again that algorithm performance is best on the fly species. Notable are the higher values of human and mouse sets compared to yeast sets, although by a very small margin.

Specificity averaged across all data sets in the group.
To validate our algorithm, we compared the values of parameter I calculated from known and predicted alignment in selected generated sets. We took 4 and 6 sets of widths 9 and 12, respectively. In all of the sets the motifs are not consistent, L = 1000 and E = 1. The values of parameter I, averaged across 10 runs in each of 10 selected sets from predicted alignments of GraphGibbs and the basic algorithm, are shown in Table 1 alongside the value of I from known alignment. The average index of point of convergence is also added to the table to compare how fast the algorithm converges to the final alignment from partially determined initial alignment (GraphGibbs) versus completely random initial alignment (basic algorithm). The difference is small, however it does show that with GraphGibbs the convergence is faster.
RSD, relative standard deviation; boldface indicates significant difference.
To examine the accuracy (Zupan, 2009) of both algorithms, a paired two sample t-test for means of parameter I was performed on two groups differentiated by motif width. In the case of the GraphGibbs algorithm, the t-test shows no significant difference between the known and predicted alignment. The corresponding p-value (two tail) shows no statistically significant difference for neither first group, W = 9 (p = 0.14), nor the second group, W = 12 (p = 0.06). Meanwhile, for basic algorithm the t-test showed no significant difference where W = 9 (p = 0.09), but there is a significant difference (p < 0.05) in the second group. To emphasize significant differences, a boldface font was used in Table 1.
Since each set was run 10 times with both algorithms, we calculated the relative standard deviation (%RSD) of parameter I on both groups to measure repeatability (Zupan, 2009). The values for both groups are lower in the case of GraphGibbs opposed to values gained from the basic algorithm. The dispersion in the second group is higher due to the lower accuracy and higher degree of residue variation. In this group the difference between values of %RSD is more noticeable as well, because of the random sampling of the basic algorithm.
The time component of the two algorithms was compared in regards to site sensitivity. Both properties were averaged across 10 runs. For each algorithm the results were further divided by N, since larger sets are expected to have longer execution time. Figure 5 shows that basic algorithms have shorter execution time compared to GraphGibbs, however, the site sensitivity of the basic algorithm is very low. In contrast, GraphGibbs has higher average site sensitivity, although it takes longer to execute. This is also expected, since GraphGibbs builds upon the basic algorithm and has a wider search criteria.

Site sensitivity in relation to execution time of GraphGibbs and the basic algorithm on 10 sets. The sets are further grouped by N.
5. Conclusions
In this article, we explore another way of finding relevant motifs in DNA sequences. We build upon a Gibbs sampling method to help determine the regions of search space, which are more likely to contain a global extreme, consequently increasing accuracy of the predicted motifs.
The statistics have shown that GraphGibbs algorithm does enhance the basic algorithm of Gibbs sampling in terms of site sensitivity and faster convergence of the sampling method. It's performance on generated data sets is good even on sets with nonconsistent motifs at various degrees of residue variation. GraphGibbs also detected motifs in real data sets, although the sPPV was lower on most data sets compared to sSn. As discussed above, this is in part a consequence of algorithm setting, since the expected number of motif occurrences is uniform for all sequences. Hence the alignment N × E matrix is full and results in more false positives. In addition, the motifs in real sets are determined experimentally and some functional motifs of interest may have escaped detection and classification. Thus, the resulting values of sPPV may be perceived lower than in cases of fully analyzed real sequences.
To help improve the accuracy of our algorithm, we are focusing on reducing the number of false positives by considering nonuniform distribution of expected numbers of sites through the set. This means that we have to look at the number of motif occurrences individually per sequence. We are exploring a probability distribution to help estimate the number of copies of a motif per sequence as was introduced by Thijs et al. (2002). Another modification in progress is a motif-ranking procedure to rank all motifs found by the algorithm. Additionally, we want to mitigate the limitation on motif width of predicted motifs, a consequence of part one of the algorithm, where consistent motifs are being detected and those are usually short. This way we would lower the number of false negatives on nucleotide level.
Since the GraphGibbs algorithm is essentially trying to find patterns composed with a specific alphabet, the algorithm could be applied to pattern recognition problems with structured patterns outside of biology and genetics, with slight adaptation to the new alphabet and inherent properties of the data set.
Footnotes
Acknowledgments
The author thanks Damjana Kokol Bukovšek, Matjaž Omladič, Alenka Stepančič, and Gregor Šega for their guidance and help with the development and analysis of the algorithm. The author also thanks Arctur d.o.o. for technical support and access to the supercomputer Arctur-1 for algorithm execution. This work was supported in part by the European Union, European Social Fund. The real data set gathered by Tompa et al. (
) can be accessed online.
Author Disclosure Statement
No competing financial interests exist.
1
The choice of number of vertices is arbitrary, however, we found that usually around ten vertices stand out with higher degree compared to the rest.
