Abstract
Abstract
Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances, and to provide a lower misclassification rate when used with Support Vector Machines (SVMs). We confirm by independent experiments these two results, and propose in this article to use a coverage criterion to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.
1. Introduction
T
The organization of the article is as follows. Section 2 gives notation and definitions related to spaced seeds. Section 3 defines the coverage of spaced seeds and proposes the tools used to measure it. Section 4 shows how coverage can be used in two biologically oriented applications : first, when spaced seeds are included within SVM kernels (subsection 4.1), and second, when spaced seeds are applied to measure phylogenetic distances (subsection 4.2). In this last case, we also propose a new distance based on the coverage (subsection 4.2.2) and the substantial improvement achieved. Section 5 provides, at the end, some concluding remarks.
2. Notation
We suppose here that strings are indexed starting from position number 1. For a given string u, we will use the following notation: u[i] gives the i-th symbol of u, |u| is the length of u, and |u| a is the number of symbol letters a that u contains. Also, d (u) is the prefix of length d of the string u, and (u) d is the suffix of length d of the string u. For two strings u and v, u · v is the concatenated string.
Alignments without gaps (indels) can be modeled by a succession of mismatch or match symbols, and thus be represented as a string x in a binary alphabet {0, 1}. A spaced seed can be represented as a string π, but in a different binary alphabet {*, 1}; 1 indicates a position on the seed π where a match must occur in the alignment x (it is thus called a must match symbol), whereas * indicates a position where a match or a mismatch is allowed (it is thus called a joker symbol). The weight of a seed π (denoted by w) is defined as the number of must match symbols (w = |π|1), whereas the span/length of a seed π (denoted by k) is its full length (k = |π|).
A spaced seed π of length k hits an alignment x of length n starting at position
The usual requirement for a seed, when used to detect alignments x, is to have at least one hit (Keich et al., 2004) in x, the so called single-hit criterion. Several methods are also based on multiple hits, as they require more than one hit to trigger an alignment extension (Burkhardt et al., 1999; Rasmussen et al., 2006; David et al., 2011). In the next section, we extend the way to define criteria based on seed hits by measuring coverage provided by these hits.
3. Definition and Computation of the Seed Coverage
3.1. Definition of the coverage
The coverage of a seed π on an alignment x is defined by the number of 1's in the alignment x that are covered by at least one must-match symbol of one of the seed's hits (Benson and Mak, 2008; Martin, 2013; Martin and Noé, 2014).
For example, the seed π = 11 * 1 has three hits on the string alignment x = 101111001011111. The coverage provided by these hits (denoted by • symbols below) is 8.
The coverage concept can be generalized to multiple seed patterns. For example, the set of seeds {π1, π2} = {11 * 1, 1 * 1 * 1} has six hits on the string alignment x. The coverage provided by these hits is 11.

3.2. Coverage automaton
Given a seed π or a set of seeds 
For a set of seeds
For example, for the set of seeds {π1, π2} = {11 * 1, 1 * 1 * 1}, the state reached when reading the first string alignment x = 1011110011110 (used in the previous example), is represented by the pair 
The new state may be represented by the pair
From the point of view of the automaton definition, two finite state machines are possible: Mealy or Moore. Accordingly, the automaton must provide the coverage increment, either on each transition (for the Mealy automaton) or on each state (for the Moore automaton). For example, on the set of seeds {π1, π2} = {11 * 1, 1 * 1 * 1}, these two representations are illustrated in Figures 1 and 2; due to size, we present here the minimal version for both automata by merging equivalent states. For readability, when some hits occur, we have represented the states

Minimized Mealy coverage automaton (count on transitions) for the seeds {π1, π2} = {11 * 1, 1 * 1 * 1}.

Minimized Moore coverage automaton (count on states) for the seeds {π1, π2} = {11 * 1, 1 * 1 * 1}.
The Mealy automaton is obviously more compact when considering the number of states. On the other hand, it requires one to store an additional value per transition [and also needs more specific algorithms; for example, the Hopcroft minimization algorithm (Hopcroft, 1971) must be adapted to the Mealy case].
Each representation has been independently implemented by one of the authors; the one based on count on transition (Mealy) is implemented in Matlab (see Martin, 2013; Martin and Noé, 2014), and code has also been tested on Octave (Octave community, 2014), whereas the other, mainly for compatibility issues, is based on count on states (Moore), and is generalized for subset seeds (slight extension of spaced seeds) with multiple seeds in mind (Kucherov et al., 2007). The “Mealy” Matlab code is available upon request from the second author, and the “Moore” code is included in the C++ Iedera program (Kucherov et al., 2014) starting from development version 1.06 α7.
Several minimizations of the states (considering both q and c) can be considered during the construction of these automata, but the details are out of scope of this article (see Kucherov et al., 2007; Martin and Noé, 2014). In practice, we use at least two methods to detect coverage strings c that are equivalent, together with the optimization of Kucherov et al. (2007) on strings q to save some memory space before completing the full automaton. Note that this last automaton, once entirely built, can always be fully reduced to its minimal form, for example, by applying the classical Hopcroft minimization algorithm (Hopcroft, 1971).
Independently, we also mention that it seems difficult, for this special coverage problem, to find an equivalent classical regular expression to help build the automata. Even classical tools (such as grep) have for example equivalent parameters to simulate a single or multiple hit, but no parameter is provided for this coverage problem.
3.3. Computation
Given a generative model for the string x, it is possible to compute the distribution of the coverage values according to a Markov process (Martin, 2013; Martin and Noé, 2014) or any model that can be represented by a nondeterministic probabilistic automaton (Kucherov et al., 2006; Nuel, 2008; Marschall et al., 2012; Martin and Noé, 2014). We don't consider directly in this article this computation, as the model used in our tests is pure Bernoulli; the computation can thus be performed directly with a simple dynamic programming algorithm on the coverage automaton of section 3.2. We refer to the aforementioned articles for more details on more complex probabilistic models.
Independently, we also mention that the work proposed here is applied on the lossy seed framework, in the sense that we consider a probability to hit (or cover) an alignment sequence x generated by a model χ. However, this work is not strictly limited to probabilities and can be easily extended, for example, to the lossless seed framework. In that case, the set of alignments is fixed, for example, by giving a fixed length together with a fixed number of errors; the problem is then to always hit (or cover) any of the alignments on this set (so without loss). This last computation can be done easily, simply by replacing the semi-ring used for probabilities by a less conventional tropical semi-ring (Simon, 1988; Pin, 1998; Mohri, 2009) used for match/mismatch scores or mismatch costs. Note also that the simple fact of counting the number of alignments, in alignment classes that have a given percentage of identity (as done in Benson and Mak, 2008) or a given coverage for a set of seeds, or any combination of these elements, is also possible, by use of a counting semi-ring adapted for this task (Huang, 2006).
4. Experiments
In this section, we consider two biological sequence oriented applications that have recently been proposed to use spaced seeds; SVM classifiers based on spaced string kernels (Onodera and Shibuya, 2013) and alignment-free distance estimators using spaced k-mers (Boden et al., 2013; Leimeister et al., 2014; Horwege et al., 2014). We show that the coverage sensitivity can be used in both cases to improve the estimators, and thus also be applied to the selection of the best seed patterns on such domains.
4.1. Coverage sensitivity and spaced seed string kernels
String kernels (Lodhi et al., 2002) are a classical model used for text classification with SVM. They have frequently been applied to biological sequence classification, as k-spectrum kernels (Leslie et al., 2002), mismatch k-spectrum kernels (Leslie et al., 2004), string alignment kernels (Saigo et al., 2004), and profile-based string kernels (Kuang et al., 2005) to cite a few examples.
The k-spectrum kernel and its derivative is mostly used with contiguous seeds; surprisingly, no spaced seeds were designed to comply with this approach. However, it has been experimentally shown (Onodera and Shibuya, 2013) that spaced seeds help decrease the zero/one misclassification rate in practice, even for the simplest kernels. The main reason of this lack might be the intrinsic difficulty to find a correct estimation criterion for spaced seed patterns, but on the other hand, not so much effort has been made to increase the diversity of criteria used. Most of the proposed algorithms to estimate spaced seed sensitivity concentrate on the single-hit criterion (“at least one hit for a seed/set of seeds”). This criterion makes sense for classical “hit and extend” alignment methods used in bioinformatics, but seems to be too restrictive for spectrum kernels that are supposed to filter the information content based of “several concordant clues.”
The multihit criterion (“at least t hits for a seed/set of seeds”) seems at first more appropriate for this task, but again has never been tried in this field of research. One possible drawback is that it does not distinguish highly overlapped hits of seeds from disjoint ones. Finally, and for the latter reason, we also decided to apply the new coverage criterion (“at least t covered 1-symbols in the alignment, each covered by at least one 1-symbol of a seed hit”) in comparison with the two others. In the two following subsections, we try to correlate these three criteria with the SVM zero/one misclassification rate.
4.1.1. SVM-benchmark and protocol
The benchmark used for this test consists of 2208 families extracted from the noncoding RNA database RFAM v11.0 (Burge et al., 2012). It represents up to 65908 sequences per family.
We decided to split each family by randomly picking 50% of its sequences for the SVM learning process and keeping the 50% remaining for the classifier to measure the zero/one misclassification rate. We use the SVMmulticlass (Joachims, 2002) package Version 2.20 (dated August 14, 2008) with the linear kernel. In each case, single or double seeds of weight 3 and span from 3 to 7 were used as a k-spectrum string kernel.
4.1.2. Seed sensitivity
In parallel, for each single or double seed we compute its “sensitivity,” either using the single-hit criterion, the multihit criterion, or the coverage criterion. Note that, for the two last criteria, we have the possibility to change the threshold t required to consider a success. We arbitrarily choose to measure these seeds on an i.i.d. alignment model of length 32 (the probability to have a 1-symbol in the alignment has been fixed at 0.7), although experiments show that this does not have much influence on the final results (data not shown).
Examples of comparative plots are given in Figure 3 for multihit and coverage-hit; a slight correlation can be seen at first sight. But we can also see that some seeds with repetitive and highly correlated patterns (e.g., 1*1*1), usually bad in theory, are in practice more efficient for the SVM classifier.

Zero/one misclassification rate vs theoretical sensitivity.
4.1.3. Correlation between zero/one misclassification and the three criteria
Finally, to determine if one of the three estimators was better suited to correlate with this SVM classifier task, we computed the sample Pearson correlation coefficient for each of the three criteria, for each set of seeds. This gives the best correlation between the theoretical seed sensitivity estimated by one of the three estimators with the experimentally measured sensitivity of the SVM classifier of each set of seeds.
For both the multiple hit and coverage criteria, we allowed the threshold parameter t for seed sensitivity to vary (x-axis). These results are illustrated in Figure 4 for single and double seeds.

Correlation coefficient between zero/one misclassification rate and theoretical sensitivity.
Surprisingly, correlation results for the multihit criterion are not good when the number of hits required is too large. This must be taken into account when using this criterion because the multihit criterion gives the correct results for double seeds when the number of hits is, for example, at 2.
The single-hit criterion gives good results for each set. However combining single and double seeds into one set and doing the same experiment makes it the worse of the three estimators (data not shown). A carefully chosen value for the coverage criterion (here between 14 and 16) helps to reach the highest correlation of the three for double seeds. On single seeds, this is difficult to conclude, due to the few seeds of weight 3 that have been tested. Note that we also tried the same experiment for seeds of weight 4 but the dimension used here (44) makes the classifier more random without a preselection of dimensions (data not shown).
We can first notice that the correlation of the single-hit criterion is more stable than the correlation of the coverage criterion that varies more for lower thresholds. It also seems that the optimal coverage threshold is at some point a surprisingly quite regular and convex function that might be estimated when enough data is available.
4.2. Coverage sensitivity and alignment-free distance for sequence comparison
Estimating alignment-free distance is a common method used for sequence comparison in multiple alignment tools guided tree estimation (Edgar, 2004) and related phylogenetic tree estimation (Qi et al., 2004; Liu et al., 2008). Several distances are based on fixed size k-mers (Vinga and Almeida, 2003; Simsa et al., 2009) with possible mismatches allowed (Apostolico et al., 2014), or with variable length k-mers—local decoding (Didier et al., 2012), irredundant common subwords (Comin and Verzotto, 2012), etc. They are applied on assembled genomes (Haubold et al., 2005; Chor et al., 2009), protein classification (Stropea and Moriyama, 2007), and even on unassembled genomic data to estimate phylogenies (Maurer-Stroh et al., 2013). We refer to a recent special issue on alignment-free methods for more details (Vinga, 2014).
Interestingly, it's only in the last year that the use of spaced seeds has been proposed (Boden et al., 2013; Leimeister et al., 2014; Horwege et al., 2014), with recent applications for specific next generation sequencing (NGS) tasks, such as multiclonal clusterization (Giraud et al., 2014). Here again, the lack of seed criteria used in the literature didn't help the selection of good seeds for these tasks.
In subsection 4.2.1, we recall that the “classical” distance can be estimated by multihit sensitivity computation, which helps in selecting good spaced seeds. In subsections 4.2.2 and 4.2.3, we also show that coverage sensitivity can be used in a more elaborate distance; this distance can be computed using the longest increasing subsequence (LIS) of the positions of the common hits between gapped k-mers. As LIS can be computed in t · log(t) time, where t is the number of hits, it is thus a reasonable estimator in practice.
4.2.1. Multihit experimental support
One common method used to estimate alignment-free distances is based on k-mer frequency; 4 k counts can be first made and used as simple feature frequency profiles (where counts are normalized to relative frequencies for any of the 4 k k-mers), or more elaborate composition vectors (where normalization is done with the help of a background model). Distances can then be estimated by several models (Vinga and Almeida, 2003) to provide phylogenetic applications with an initial distance matrix. As some of these phylogenetic methods, such as unweighted pair group method with arithmetic mean (UPGMA) (Michener and Sokal, 1957), start by considering small distances, it's important to have the best estimator here and keep track of common k-mers (or spaced k-mers) and their common locations in the two sequences. One estimator that can help in that task is the number of seed hits obtained; we will call it the multihit value.
For our experiment, we use a set of seeds (627 seeds of weight 3 or 4, span up to 7, single seed or double seed patterns), a percentage of identity varying from 20% to 100% by steps of 5% each time, and we generate (for each percentage of identity) 1000 alignments of length 32. We then measure the multihit value of each alignment and compare it to the true alignment distance.
It can be shown first (Fig. 5, x-axis only) that the correlation coefficient is high (>0.9 for seeds of weight 3, less otherwise). Provided that we expect to pay a little additional cost, it is possible to improve this result, as shown in the next section.

True distance correlation with multihit (x) vs. coverage (y) distance.
4.2.2. Coverage experimental support
The distance we propose to measure is based on the number of covered 1-symbols in the alignment, each covered by at least one 1-symbol of a seed hit. We will call it the coverage value. To show how this distance better estimates the true distance (we assume here that the Hamming distance is the true distance), we are repeating the same experiment with both the multihit value and the coverage value on the set.
We use the same protocol here—the same set of seeds (627 seeds of weight 3 or 4, span up to 7, single seed or double seed patterns), the same percentage of identity varying from 20% to 100% by steps of 5% each time, and generating for each percentage of identity the same 1000 alignments of length 32 each time, we measure the multi-hit and the coverage values for each simulated alignment. Then, we compared the correlation coefficient for each of these two measures with the true percentage of identity used to simulate the alignment.
The correlation coefficient for all the seeds was 0.88 for the multi-hit value and 0.96 for the coverage value. We tried to refine this first experiment by separately measuring single seed patterns and double seed patterns and running the same test. For single seed patterns, the correlation was 0.89 and 0.94 respectively, whereas for multiple seed patterns, it was 0.89 and 0.96 respectively. We also tried to measure this correlation for each of the 627 seeds; Figure 5 plots these two correlations (pair of coordinates).
Note first that, as all the points for this plot are in the upper left region, the true distance is better estimated by the coverage value than by the multihit value. We can also notice that double seeds outperform single seeds in both cases, so that multiple seed patterns can help in estimating the distance more accurately than single seed patterns. The gain is even better for the coverage value than for the multihit value.
From the point of view of the seed weight and the number of seed patterns used, we can see that using two patterns of weight 4 gives the same correlation coefficient as using one single pattern of weight 3, but only for the coverage value, not for the multihit value. This encouraging result may help defend the idea that more patterns of larger weight will help measure a correct distance. Note that this conclusion is quite similar to the one provided 10 years ago for detecting alignments (Li et al., 2004), which was recently and independently observed in Leimeister et al. (2014); Horwege et al. (2014), but here, as the distance estimation problem is quite different from alignment detection, the seeds designed will probably be completely different from those previously seen.
From the point of view of the seed patterns, we can see in Figure 5 that, for single seeds, selection done for both values gives the same optimal seed pattern 11*1 (or its mirror) for weight 3, and the same optimal seed pattern 1*1**11 (or its mirror) for weight 4. The choice for the optimal double seed patterns differs between the multihit or coverage values, and this difference is even more marked for seeds of weight 4.
However, computing the coverage is more difficult than simply counting common k-mers. We justify in the next part that, given two easily measurable assumptions on the sequences and the k-mer weight, this task can be done efficiently.
4.2.3. Coverage algorithmic point of view
In this part, we briefly describe how coverage can be computed efficiently. Given two sequences s1 and s2 of equivalent length, we want to search for the spaced k-mers that are common to s1 and s2. But, more than establishing a frequency profile for these common k-mer codes, the main idea here is to find a set of common k-mers that have the same order of position occurrences on s1 and s2. To do so, one solution is to keep occurrences of any of the 4 k possible k-mers in a reverse list of positions (given one k-mer code, we have two lists of positions where this k-mer occurs, on s1 or, respectively, on s2). Keeping the common k-mers of both s1 and s2, sorting their list of pairs of occurrence positions according to the positions of one of the two sequences (for example, positions along s1), then applying an LIS (or a windowed LIS if the two sequences are not of similar lengths) on s2, provided that spurious k-mers (those occurring randomly) are not frequent, will give a better approximation for the number of true hits and thus can be used to compute the coverage.
Note first that the LIS can be computed in t· log(t) time (Schensted, 1961), where t is the number of hits (e.g., pairs of positions for a common k-mer). This value t, provided that k is well chosen to correctly filter spurious k-mers and there is no composition bias on both sequences, must be either close to |s1| and |s2| if the s1 and s2 sequences are similar (and without self-repetitive bias/low complexity regions), or reasonably low if the sequences are nonsimilar, but can be otherwise high for low complexity/self-repeating/redundant regions that similarity search tools usually want to avoid.
Note also that, once the common and ordered hits are collected by the LIS procedure, it is possible to compute the coverage using:
• either a masking process using shift–or for collecting the coverage symbols, and then computing the coverage increment (which implies an additional CPU cost if no population count instruction is available), • or an automaton (an example is provided in Fig. 6 for the hits of the seeds {π1, π2} = {11 * 1, 1 * 1 * 1}) that keeps the last overlapping suffix of the previously encountered hits for any of the seeds. This automaton has an alphabet of size 2#seeds, since we record whether or not there is a seed hit for each seed. Otherwise, a very similar definition to the coverage automaton holds. Once this automaton is built, it is possible to compute the coverage increment in constant time.

Mealy coverage increment automaton for hits of the seeds {π1, π2} = {11 * 1, 1 * 1 * 1}.
In both cases, gaps (indels) must be taken into consideration because they break—from a dot-plot point of view, diagonals—thus reinitializing the automaton or the coverage mask.
5. Concluding Remarks
We have presented how the coverage criterion (Benson and Mak, 2008; Martin, 2013) can help in measuring the seed efficiency in two recent problems: a classifier based on spaced k-mers (Onodera and Shibuya, 2013) and a k-mer alignment-free distance estimation (Boden et al., 2013 Leimeister et al., 2014; Horwege et al., 2014). We have also shown how to extend the second one to be even more sensitive.
The Moore (or Mealy) automaton obtained to measure the coverage criterion is by itself of interest for several reasons; its size seems to be bounded by polynom(w, r) ×
For example, the coverage automaton size for the PatternHunter 1 seed 111*1**1*1**11*111 is:
where the current sizes for Moore and Mealy automata are respectively obtained by the Iedera tool (Kucherov et al., 2014, version 1.06 α7) or by the Matlab code (Martin and Noé, 2014) before minimization by the gap-system functionally recursive group (FR) package (Bartholdi, 2012). These sizes can be compared with those of the mere multihit automaton:
Although the coverage automaton is more than 10 times larger than the equivalent multihit automaton, it is still usable for dynamic programming computation.
This is even true for multiple spaced seeds. For example, the coverage automaton size for the PatternHunter 2 seeds of weight 11: 111*1**1*1**11*111, 1111**11**1*1****1*11, 11*1****11***1*1*1111, 111*111*1***1111 (called first four in Li et al., 2004) is
to be compared again with the mere multihit automaton current size (and its minimal size):
Although more than 20 times larger than the equivalent multihit automaton, the coverage automaton for multiple seeds is again still usable for dynamic programming computation.
It would also be interesting (but out of the scope of this article) to consider SVM kernels or k-mer distances with subset seed (Kucherov et al., 2007; Yang and Zhang, 2008; Gambin et al., 2011; Frith and Noé, 2014) or more general vector seed (Brejová et al., 2005) techniques. Several string kernels, such as the mismatch string kernel (Leslie et al., 2004), use this general concept but generate full neighborhoods (all the words at a given distance from a given k-mer). Moreover, optimal resolution (best seed weight) (Simsa et al., 2009) remains an open problem for spaced seeds in both SVM kernels or k-mer distance problems. Note also, if one wants to avoid this optimal resolution question, seed design and increasing weight can be combined (as done in Csűrös, 2004; Kiełbasa et al., 2011), but may not be always directly compatible with the aforementioned cited works on variable k-mers.
A last idea to explore is also to merge the definition of clumps (Stefanov et al., 2007; Bassino et al., 2008; Martin and Coleman, 2011; Marschall et al., 2012; Régnier et al., 2014) with coverage, for example, by giving more significance (than a linear weight function) to coverage provided by clumps of hits than coverage provided by isolated hits.
6. Appendix
6.1. Seed coverage automaton size
We consider in this part the size of the seed automaton. Given a seed of weight w and r jokers, we are particularly interested in a bound for the size of the coverage automaton, as this can provide a limit on memory needed for future analyses.
In this section, we first solve the problem in the special case of a seed of the form 1* r 1, before going to a more general case of a seed of weight w and r jokers, for which we show a more general (but less satisfying) upper bound.
6.1.1. Seed 1*r1 coverage automaton size
The 1* r 1 seed family has already been shown to reach the multihit automaton size bound (Kucherov et al., 2006). As a nightmare for the classical seed design tools, such seeds are good candidates to start with.
The multihit automaton size is, in the general case, of maximal size (w + 1)2 r (Buhler et al., 2005; Kucherov et al., 2006). Moreover, for seeds of the form 1* r 1, this size cannot be reduced further (Kucherov et al., 2006); thus, 1* r 1 always have multihit automata of size 3 × 2 r (illustrated in Fig. 7a where not all the transitions are shown).

Moore multihit (
The coverage automaton size for seeds of the form 1* r 1 is respectively 4 × 3 r for the Moore automaton, and 3 × 3 r for the Mealy automaton.
1. the very first 1 symbol under a must match symbol can be covered or not (two possibilities: 1 or
2. the next l-1 symbols under joker symbols can be independently chosen as 0 or 1 (three possibilities: 0, 1, or
3. the very last symbol under the last joker symbol can be independently chosen as 0 or 1 (two possibilities). Note that this 1, as new, cannot be covered by a previous hit.
For a given prefix q with
Such seeds 1* r 1 have thus a Moore coverage automata of size 4 × 3 r .
Note that this size cannot be reduced. In other words, given any pair of states
• if qa and qb are different, then it is always possible to find one string u such that only one of the two walks reaches a final state (as done in Kucherov et al., 2006).
• otherwise, the coverages ca and cb must be different; it is then always possible to find one string u going to two final states that have a different coverage increment for the Moore automaton.
We concentrate now on the Mealy automaton. The main difference with the Moore automaton is that suffixes of full length k (that are final states of the Moore automaton) are not represented because coverage values are set on transitions, and not on states (see Martin and Noé, 2014).
For the Mealy automaton of Martin and Noé (2014), and seeds of the form 1* r 1 (of length k = r + 2 and weight 2), there are 2 × 3 l proper prefixes q of length l + 1 (0 ≤ l + 1 < k):
1. the first symbol must be 1 or
2. the next l symbols can be 0, 1, or
Adding the initial state gives:
This bound is reached for the same reasons of nonreducibility (applied on transition labels on Mealy, and not on final state labels as in Moore). ■
6.2. Coverage automaton size in the general case
Now consider a seed of span k with r jokers and of weight w (w + r = k). Following the previous section, 6.1, a similar reasoning gives a bound on the automaton size of 2 w × 3 r for the Moore automaton and of (2 w −1) × 3 r for the Mealy automaton.
• This is first true for j = 0, because the empty state (also called the initial state)
• If we suppose that it is true for a given i
1. When |q| = |c| ≤ i, by taking the set of the
2. Otherwise, when |q| = |c| = j, by considering and adding to this set the states
(a) If the last symbol π[j] of the seed prefix j (π) is a must match, this symbol can only be compatible with a 1 on q[j] (and this 1 cannot be covered by c[j], as the last one being added).
(b) If the last symbol π[j] of the seed prefix j (π) is a joker, this symbol can be compatible either with a 0 or a 1 on q[j] (which cannot be covered by c[j] too).
Considering now the prefix i (π) preceding π[j], we can see that:
– The wi must match symbols of
i
(π) are compatible with a 1 or a
– the remaining ri jokers of
i
(π) are compatible with a 0, a 1, or a
Combining each of the cases (a) and (b) with the preceding prefix
i
(π) gives
At the end, because (a) wj = wi + 1 and rj = ri, or (b) wj = wi and rj = ri + 1 otherwise, we can see that summing the number of states when |q| = |c| ≤ i and when |q| = |c| = j gives the expected result
It must then be noticed that, even for final states, new symbols that have just been covered
We concentrate now on the Mealy automaton. Again, the main difference with the Moore automaton is that suffixes of full length k are not represented because coverage values are set on transitions, but one thing to consider is that the last symbol can be covered on the Mealy automaton (see Martin and Noé, 2014). By a similar reasoning, there are thus at most
Note that if we suppose that the seed starts with a must match symbol, then this bound can be reduced a little more:
■
We notice in practice a much smaller size, and we suspect this bound more likely to be a polynom(w,r) × 3 r value, instead of exponential both in 2 w and 3 r . In the special case of symmetric seeds, we already have a very simple proof of this polynom(w, r) × 3 r bound. This is interesting, because experimentally, seeds of the form 11 u (*1 u ) r 1 have been shown to give large coverage automata size.
Note even if not satisfying, the general result still improves on the only available “bound” proposed to date (in Benson and Mak, 2008) that can be estimated to be of order w2 w 4 r .
Footnotes
Acknowledgments
D.E.K. Martin was supported in this research by the National Science Foundation under Grant DMS-1107084. L. Noé was supported by a CNRS Mastodons grant, and benefited from a half-time course buyout from the French Institute for Research in Computer Science and Automation (Inria).
Author Disclosure Statement
No competing financial interests exist.
