Isometric words are those words whose occurrence as a factor in a transformation of a word u in a word v can be avoided, while preserving the minimal length of the transformation. Such minimal length refers to a distance between u and v. In the literature, isometric words have been considered with respect to the Hamming distance and the Lee distance; the former especially for binary words, while the latter for k-ary words, with . Ham- and Lee- isometric words have been characterized in terms of their overlaps with errors. In this paper, we give algorithms to decide whether a word f, of length n, is Ham- or Lee-isometric and provide evidence of the possible non-isometricity by returning a pair of words of minimal length whose transformation cannot avoid the factor f. Such a pair of words is called a pair of witnesses and the minimal length of the witnesses is called the index of f. The algorithms run in time with a preprocessing of time and space to construct a data structure that allows answering queries on the suffix tree of f in constant time. The correctness of the algorithms lies on some theoretical results on the index and the witnesses of a word that are here presented. The investigation on the index is completed by the characterization of words with minimum/maximum index. All the results are shown referring to both Hamming and Lee distance.
The notion of isometric word has been introduced in the framework of the research on hypercubes and, more in general, on k-ary n-cubes. The k-ary n-cube is one of the most attractive interconnection network for parallel computer systems. With the aim of providing a class of subgraphs of the hypercube having a considerably smaller size, still maintaining some metric properties, Hsu introduced the Fibonacci cubes [13]. They are they subgraphs of the hypercube restricted to vertices associated with binary words that do not contain 11 as a factor. Fibonacci cubes are isometric subgraphs of . They received a lot of attention afterwards (see [15] for a survey) and they have been then extended to define the generalized Fibonacci cube [14], as the subgraph of the hypercube restricted to the vertices associated with binary words avoiding the word f as a factor, i.e. f-free binary words. In this framework, a binary word f is said isometric when, for any , can be isometrically embedded into , and non-isometric, otherwise [16]. For example, the word 11 is isometric.
Observe that, in the binary case, the distance between two vertices in the hypercube coincides with their Hamming distance. Hence, isometric binary words can be characterized ignoring hypercubes and adopting a point of view closer to combinatorics on words. A binary word f is isometric if and only if for any integer and any pair of words u and v of length d which do not contain the factor f, u can be transformed in v by exchanging one by one the bits on which they differ and generating only words which do not contain f. Differently saying, this transformation is composed by single steps transforming a word in another at Hamming distance 1. We will call it an f-free Ham-transformation and the resulting isometric words, Ham-isometric words. When moving from binary to k-ary alphabets, with , the hypercubes are replaced by the k-ary n-cubes where the vertices are k-ary words of length n. In this case, the distance between two vertices is no more captured by the Hamming distance, but by the Lee distance. Hence, in an analogous way, f-free Lee-transformations and Lee-isometric k-ary words have been introduced; see [4], and [3] on quaternary words. Recently, isometric words have been introduced and investigated in [1,2] referring to an edit distance based on swap and mismatch errors. Also, binary non-isometric words have been considered in the two-dimensional setting, and bad pictures have been investigated [7]. See [9] for such results in a common framework.
Deciding whether a word is Ham-isometric (Lee-isometric, resp.) can be efficiently done using a characterization of Ham-non-isometric (Lee-non-isometric, resp.) words as the ones showing a particular overlap with errors, called 2-Ham-error overlap (2-Lee-error overlap, resp.). A 2-Ham-error overlap (2-Lee-error overlap, resp.) of a word f is a prefix of f whose Hamming (Lee, resp.) distance from the suffix of same length is exactly 2. The evidence that a given word f is Ham-non-isometric (Lee-non-isometric, resp.) can be given by exhibiting a pair of f-free words whose Ham-transformations (Lee-transformations, resp.) cannot avoid the factor f. Such pair of words is called a pair of Ham-witnesses (Lee-witnesses, resp.) and the minimal length of the witnesses is called the Ham-index (Lee-index, resp.) of f. The main goal of this paper is in efficiently deciding whether a word is isometric or not, computing its index and providing a pair of witnesses of minimal length. All the results and considerations will be given for k-ary words, for any , and referring to both Hamming and Lee distance. With the sake of readability, when a sentence or statement holds true in general for both distances, we will sometimes omit to specify the distance (Hamming or Lee distance).
The notion of Lee-isometric word equals that of Ham-isometric one for the first values of the cardinality k of the alphabet, . Actually, the Lee and Hamming distances coincide in these cases. Observe that Ham-isometric words exist on any k-ary alphabet, with . On the contrary, when , no Lee-isometric word exists and any Lee-non-isometric word has Lee-index equal to its length [4]. Therefore, the computation of the Ham-index is proposed for any , while the results on the Lee-index are stated only for . A “standard” construction of Ham-witnesses for a Ham-non-isometric word was given in [18] for binary words, and generalized to k-ary words in [3,4]. A pair of Ham-witnesses is constructed in correspondence to any 2-Ham-error overlap. To complete the scenario, we provide a “standard” construction of Lee-witnesses for a Lee-non-isometric quaternary word. In this case, a new situation must be handled; it is when the Lee distance 2 is realized by an error in a unique position, rather than by two errors in two distinct positions. In this case, a pair of Lee-witnesses can be directly computed from the word, without going through its binary representation, as it was in [3]. As a main result, we show that such “standard” witnesses are minimal in length, that is the index of f can be computed as their minimal length, taken over all 2-error overlaps of f. This result is shown in Theorem 2 referring to the Hamming distance and in Theorem 3 referring to the Lee distance. The construction of such standard witnesses allows to restate some lower and upper bounds on the Ham-index (Lee-index, resp.) of f, already given in [3,4]. The investigation on the index of a non-isometric word is completed by the characterization of words which attain the bounds, i.e., which exhibit a minimum/maximum index.
The above mentioned results (see Section 3) have been then applied to design algorithms to efficiently decide whether a word is isometric, to compute its index and to provide some witnesses, if the word is non-isometric. More precisely, we will first show Algorithm1, that computes the Ham-index of a k-ary word, with , and yields a pair of Ham-witnesses of minimal length, when the word is Ham-non-isometric. An analogous Algorithm2 is then provided when and the Lee-distance is concerned. Note that an algorithm for computing the index of a binary word and a pair of its witnesses is given in [18] referring to the Hamming distance; it runs in time on a binary word of length n. The algorithms designed in this paper run in time and space. Some implementing ideas – namely the Kangaroo method – take inspiration from algorithms provided in a recent paper [8] to check whether a word is Ham-isometric and Lee-isometric in time and space. These algorithms check if a word f has at least one q-error overlap, for some integer q, while our algorithms continue finding and analysing all 2- error overlaps to provide the additional information, as said before, while preserving time and space complexity.
The efficiency of Algorithm 1 and Algorithm 2 relies on the Kangaroo method [12,17]. This method is generally used in designing algorithms for efficient pattern matching with mismatches. It allows finding the mismatches in a given alignment of two words by “jumping” from one error to the next, as here explained. Recall that a 2-Ham-error overlap of is a suffix of f that differs from the prefix of the same length in exactly 2 positions; while a 2-Lee-error overlap of is a suffix of f at Lee distance 2 from the prefix of the same length. In both cases, it can be detected comparing each suffix of f, starting at position i, for , with the string itself, that can be considered as the suffix starting at position 0. The key observation is that the first position where the suffixes (possibly) differ is the position just after the longest common prefix of the two suffixes. Similarly for the second error position. Suffix trees are a well-known data structure to handle the suffixes of a given word. In particular, the longest common prefix of two suffixes can be obtained as the Lowest Common Ancestor () of the two corresponding leaves in the tree. A query on the suffix tree can be answered in constant time and space, if the suffix tree is replaced by some equivalent arrays (the Suffix Array, its inverse, and the Longest Common Prefix Array [10]) and a constant time algorithm to answer Range Minimum Queries () is available [11].
The algorithm for computing the Lee-index is obtained adapting the one for the Ham-index to the new distance, without changing its complexity. Hopefully, these algorithms could be modified, to compute the index and related witnesses of a word, referring to other distances, besides Hamming and Lee ones.
A preliminary and partial version of this paper is [5].
Isometric words
This section is devoted to establishing the definitions and notation about Ham-isometric and Lee-isometric words.
Let Σ be an alphabet and . Throughout the paper, Σ will be identified with , the ring of integers modulo k. A word (or string) of length n is , where are symbols in Σ. The set of words over Σ of length n is denoted . Let denote the symbol of f in position i, i.e. . Then, , for , is a factor of f. A word is said f-free if it does not contain f as a factor. The prefix of f of length l is ; while the suffix of f of length l is . When then is referred to as an overlap of f of length l.
Note that, throughout the paper, a word is indexed starting from 1, that is , according to the literature on isometric words, unless for Section 5, devoted to the algorithms, where f is indexed starting from 0, that is , according to usual programming languages.
Let , and , be two words of the same length. Then, the Hamming distance between u and v is the number of positions at which u and v differ. The Lee distance between u and v is
where Σ is identified with .
In the sequel, Σ will denote a generic alphabet of cardinality k, while Δ will denote the quaternary alphabet , referred to as the genetic alphabet. Symbols A and T (C and G, resp.) will be called complementary symbols, in analogy to the Watson-Crick complementary bases they represent. The alphabet Δ will be identified with , in such a way that A, C, T, and G will be identified with 0, 1, 2, and 3, respectively. Therefore, pairs of complementary symbols have Lee distance 2, whereas pairs of distinct non-complementary symbols have Lee distance 1.
Let us now define Ham-isometric words on a generic k-ary alphabet with . The definitions are based on the process of transforming a word in another one of equal length, changing one symbol at a time.
Let Σ be a k-ary alphabet, , and .
A Ham-transformation of length h from u to v is a sequence of words such that , , and for any , .
The Ham-transformation is f-free if for any , the word is f-free.
The word f is Ham-isometric if for all and f-free words u, , there is an f-free Ham-transformation from u to v of length equal to . A word is Ham-non-isometric if it is not Ham-isometric.
A pair of words u, v is referred to as a pair of Ham-witnesses for f, if u and v are f-free words and there does not exist an f-free Ham-transformation from u to v of length equal to .
If f is Ham-non-isometric then its Ham-index, denoted by , is the shortest length d of u, v such that is a pair of Ham-witnesses for f. The Ham-index of a Ham-isometric word is defined ∞.
Let , , be words in . The distance , since u and v differ in positions 2 and 3. Hence, there are only two possible Ham-transformations from u to v of length 2. They are the following ones, , , , and , , . Neither transformation is f-free. Therefore f is Ham-non-isometric and is a pair of Ham-witnesses for .
Analogous definitions can be given by considering the Lee distance (cf. [3,4]). They can be essentially obtained by substituting “Ham” with “Lee” in Definition 1. Nevertheless, the definitions are stated here below for the sake of completeness.
Let Σ be a k-ary alphabet, , and .
A Lee-transformation of length h from u to v is a sequence of words such that , , and for any , .
The Lee-transformation is f-free if for any , the word is f-free.
The word f is Lee-isometric if for all and f-free words u, , there is an f-free Lee-transformation from u to v of length equal to . A word is Lee-non-isometric if it is not Lee-isometric.
A pair of words u, v is referred to as a pair of Lee-witnesses for f, if u and v are f-free words and there does not exist an f-free Lee-transformation from u to v of length equal to .
If f is Lee-non-isometric then its Lee-index, denoted by , is the shortest length d of u, v such that is a pair of Lee-witnesses for f. The Lee-index of a Lee-isometric word is defined ∞.
Note that, for any and , there exists an f-free Ham-transformation (Lee-transformation, resp.) from u to v if and only if there exists an f-free Ham-transformation (Lee-transformation, resp.) from v to u.
Let Δ be the quaternary genetic alphabet, , , and . Observe that , since u and v differ in their third position only and . The sequences , , and , , are two Lee-transformations from u to v of length equal to ; they are not f-free. Actually, no f-free Lee-transformation exists from u to v. This shows that is Lee-non-isometric and that is a pair of Lee-witnesses for .
Let and let be two f-free words. Consider any f-free Ham- (Lee-, resp.) transformation from u to v of length equal to (, resp.). Then, only symbols in the positions where u and v differ are modified in this transformation. Moreover, at each step of the Ham- (Lee-, resp.) transformation, a symbol x can be replaced by y only if (, resp.). Furthermore, in a Lee-transformation, each position i such that is replaced exactly h times.
Let Σ be a k-ary alphabet and consider the case . In this case, for any words of the same length, and any Lee-transformation from u to v is itself a Ham-transformation, and vice versa, thus preserving length and f-freeness. Hence, a word f is Ham-non-isometric if and only if f is Lee-non-isometric and .
For the next definition refer to Fig. 1 showing a q-Ham-error overlap of f of length l and shift r with , , i and j error positions, such that and .
Let Σ be a k-ary alphabet, , and q be an integer, . The word f has a q-Ham-error (q-Lee-error, resp.) overlap of length l, , if (, resp.). Its shift is ; its error positions are the m positions in , , where differs from .
The word f and its 2-Ham-error overlap of length l and shift r.
Let in ; the reader can refer to Fig. 2. The word f has two 2-Ham-error overlaps. The first one has length , shift , number of error positions and error positions , . The second one has length , shift , number of error positions and error positions , .
The word and its 2-Ham-error overlaps in Example 3.
Using the notations in the previous definition, if f has a q-Ham-error overlap of length l, then . Otherwise, if f has a q-Lee-error overlap of length l, then . In particular, when and , then or . The case holds if and differ in exactly one position and the error is given by a pair of complementary symbols. For example, has a 2-Lee-error overlap of length . Indeed, and .
If then and differ in two different positions i and j and the errors are given by pairs of non-complementary symbols. For example, has a 2-Lee-error overlap of length . Indeed, and .
Next theorem, proved in the binary case in [16] and in [3,4] for the general case, provides a characterization of Ham- and Lee-isometric words, which is fundamental to test whether a word is Ham- or Lee-isometric. Furthermore, it allows us to restrict the investigation on Lee-isometric words to k-ary alphabets with .
f is Ham-isometric if and only if it has no 2-Ham-error overlap.
f is Lee-isometric if and only if it has no 2-Lee-error overlap, when.
f is never Lee-isometric, when.
The proof of Theorem 1 is constructive. Indeed, if f has a 2-Ham-error overlap, the proof explicitly provides a pair of Ham-witnesses for f, while if and f has a 2-Lee-error overlap, the proof suggests a pair of Lee-witnesses for f going through its binary representation.
Computing the Ham- and the Lee-index
The index of a non-isometric word has been introduced as the threshold on the length of words that witness the “non-isometricity” of the word. In this section, we are going to show some results on how computing the index and some “minimal” pairs of witnesses for a non-isometric word. The investigation is carried on isometricity defined in terms of both the Hamming distance and the Lee distance, and on k-ary alphabets with any . The results prove the correctness of the algorithms in Section 5.
The computation of the index is based on the construction of some pairs of witnesses for f, as given in [3,4] to show Theorem 1. The construction generalizes the one given in [18] for binary words with Hamming distance. It will turn out that the index is the minimal length of such witnesses. Let us revisit the construction, while highlighting some valuable features and adding some new results. First, let us state the following definitions. Refer to Figs 3 and 4 for the construction of , , , and , where are symbols such that and .
Let Σ be a k-ary alphabet and have a 2-Ham-error (2-Lee-error, resp.) overlap of shift r and error positions i, j, with . Then,
is the word obtained from f replacing by .
if r is even and , then is the word obtained from f replacing with , and by .
The words , , , and corresponding to the 2-Ham-error (2-Lee-error, resp.) overlap of f of shift r are
and .
if r is even, and .
The words and corresponding to the 2-Ham-error overlap of shift r in Fig. 1.
Let Σ be a k-ary alphabet and have a 2-Lee-error overlap of shift r and error positions i, j, with . Then,
is the word obtained from f replacing with
is the word obtained from f replacing with .
The words and corresponding to the 2-Lee-error overlap of f of shift r are
and
The words and where , and .
The words introduced in the previous definitions will be useful to determine some pairs of witnesses for a non-isometric word f (cf. Propositions 2 and 3). In fact, if f has a 2-Ham-error overlap or a 2-Lee-error overlap with two (distinct) error positions, then the “standard” pair of witnesses for f is . However, may not be f-free; in this case, it becomes necessary to consider the pair . More exactly, it turns out that contains f as a factor if and only if the following holds [3]. On the contrary, the pair (, ) is used when the 2-Lee-error overlap has two coincident error positions.
Let Σ be a k-ary alphabet and have a 2-Ham-error overlap (2-Lee-error overlap, resp.) of shift r and error positions i, j, with . The 2-Ham-error overlap (2-Lee-error overlap, resp.) satisfies if
Let Σ be a k-ary alphabet andhave a 2-Ham-error overlap (2-Lee-error overlap, resp.) of shift r and error positions i, j, with.
The following statements are equivalent
is not f-free
the 2-Ham-error overlap (2-Lee-error overlap, resp.) of shift r satisfies
for some word, ρ suffix of w and σ prefix of w, symbols, with(, resp.), and integers.
The proof is given only for 2-Ham-error overlaps. The case of 2-Lee-error overlaps is analogous; just replace the Hamming distance with the Lee distance.
(1. ⇒ 2.)
The proof goes analogously to the proof of Claim 2 of Lemma 2.2 in [18]. Assume that is not f-free and denote the copy of f that occurs in as . Suppose occurs at position and let . Let , , and with and .
Suppose . One has . And , this last equality holding since .
Suppose, now, . One has , this last equality holding since . And .
Hence and , that is , is even and . Therefore, , for and .
(2. ⇒ 3.)
The proof goes analogously to the proof of Lemma 2.1 in [19]. Suppose that the 2-Ham-error overlap of length l, shift and error positions i, j, with satisfies . Let and . Since , then . Let us divide and by . Assume that and , where , and . Then , , and by the equalities in it holds:
Denote by w the factor of f from position to position (if then w is empty), and . Then and .
(3. ⇒ 1.)
Suppose that for some word , ρ suffix of w and σ prefix of w, symbols , with and integers . The structure of f implies that f has a 2-Ham-error overlap of shift with error positions and .
Consider the corresponding word . From Definition 4, and, hence, . Therefore, f occurs in starting from position and is not f-free. □
The following lemma states an upper bound on the number of 2-error overlaps that satisfy for a non-isometric word. It will be useful in some subsequent proofs.
Let Σ be a k-ary alphabet withand letbe a Ham-non-isometric word (Lee-non-isometric, resp.). Then f has at most one 2-Ham-error (2-Lee-error, resp.) overlap that satisfies.
Assume by contradiction that f has two different 2-Ham-error (2-Lee-error, resp.) overlaps that satisfy . Then, by Proposition 1, there exist , (), , (, resp.) and (, resp.), such that f has two different representation forms:
where (, resp.) () is a suffix (a prefix, resp.) of .
Some considerations analogous to the ones used in the proof of Lemma 4.2 in [19], show that necessarily and . Hence, it holds , , , and and are the same representation form of f. □
Consider a Ham-non-isometric word . Then, f has a 2-Ham-error overlap. Actually, it may have several 2-Ham-error overlaps. Next proposition shows that for any 2-Ham-error overlap of f there is a corresponding pair of Ham-witnesses for f.
Let f be a Ham-non-isometric k-ary word. Consider a 2-Ham-error overlap of f, of shift r, and error positions i, j, with.
If the 2-Ham-error overlap does not satisfythenis a pair of witnesses for f
If the 2-Ham-error overlap satisfiesandthenis a pair of witnesses for f
If the 2-Ham-error overlap satisfiesandthen f has a 2-Ham-error overlap of shift, which does not satisfyandis a pair of witnesses for f.
Let us explicit the value of the shift , as stated in item 3 of Proposition 2. Referring to the proof of 2. ⇒ 3. in Proposition 1, is given by .
Let us show some examples of the construction in Proposition 2 for a ternary and a binary alphabet, respectively.
Consider the ternary alphabet . Let . The word f has a 2-Ham-error overlap of shift . Hence, is Ham-non-isometric from Theorem 1. Following Proposition 2, let us exhibit a pair of Ham-witnesses for f. We have that disagrees from in positions and , and , , and . Then, and . The words and are f-free words and . Moreover, there is no f-free Ham-transformation from to of length . Indeed, if is replaced by then f occurs in at position 2; if is replaced by then f occurs at position 1.
Consider the binary alphabet . Let . The word f has a 2-Ham-error overlap of shift . Hence, is Ham-non-isometric from Theorem 1. Note that the 2-Ham-error overlap of f satisfies . Indeed, we have , , , and . Therefore, following Proposition 2, item 2., set and consider the two words and . The words and are f-free words and . Moreover, there is no f-free Ham-transformation from to of length . Indeed, if is replaced by then f occurs in at position 3; if is replaced by then f occurs at position 1 and if is replaced by then f occurs at position 4.
Let us now consider Lee-non-isometric words.
Consider the case . According to Remark 2, for any words of same length, and any Lee-transformation from u to v is itself a Ham-transformation, and vice versa, thus preserving length and f-freeness. Hence, Proposition 2 allows also to construct some pairs of Lee-witnesses for a Lee-non-isometric binary or ternary word f. Namely, it is possible to associate a pair of Lee-witnesses for f with any 2-Lee-error overlap of f. Note that, in this case, any 2-Lee-error overlap of f is given by two distinct error positions. This is no more true when . The case is considered in what follows.
Consider the case of and the quaternary alphabet . The construction of a pair of Lee-witnesses for a Lee-non-isometric quaternary word f is obtained in [3] referring to the binary representation of f. Actually, there is an isomorphism between quaternary words and binary words of even length. It is given by the map which associates to A, C, T, G, the binary words 00, 01, 11, 10, respectively. Hence, a word will be possibly denoted as to stress its belonging to the quaternary alphabet, while its binary representation in will be denoted as . The correspondence preserves the Lee distance.
Let be a Lee-non-isometric word, r be the shift of a 2-Lee-error overlap of and m the number of its error positions. Recall that or ; see Remark 3. Note that has a 2-Lee-error overlap of shift . A pair of Lee-witnesses for is obtained in [3,4] considering a pair of Lee-witnesses for , constructed as in Proposition 2, and representing it in quaternary. Recall that , , , and . For example, if the 2-Lee-error overlap of shift of fills case 1 of Proposition 2 then is a pair of witnesses for . Thus, the quaternary representation of this pair of witnesses for , that is , is a pair of Lee-witnesses for .
Next proposition shows that, indeed, the pair of Lee-witnesses for as just obtained from a 2-Lee-error overlap can be directly constructed from , without going through its binary representation. Therefore, let us simply denote by f the quaternary word. Example 6 shows both constructions.
Let f be a Lee-non-isometric k-ary word,. Consider a 2-Lee-error overlap of f of shift r and m error positions i, j, with(iff).
If, thenis a pair of Lee-witnesses for f.
Ifthen
if the 2-Lee-error overlap does not satisfy, thenis a pair of Lee-witnesses for f
if the 2-Lee-error overlap satisfiesand, thenis a pair of Lee-witnesses for f
if the 2-Lee-error overlap satisfiesand, then f has a 2-Lee-error overlap of shift, which does not satisfyandis a pair of Lee-witnesses for f.
Consider , the binary representation of f. Since f has a 2-Lee-error overlap of shift r, then has a 2-Lee-error overlap of shift with error positions s, z and .
In the case , we have that and that and are two complementary symbols. Moreover, we have and . Suppose and ; the other cases can be treated analogously. We have , , , . Therefore, the quaternary representation of (, resp.) coincides with the word obtained from f by replacing the symbol A in position i with the symbol G (C, resp.), that is (), resp.). Moreover, , represented in the quaternary alphabet, coincides with . Hence, and . In [3], it is proved that the quaternary representation of is a pair of Lee-witnesses for f. Hence, the thesis follows.
In the case , we have that , and ( and , resp.) differ because of two non-complementary symbols.
If the 2-Lee-error overlap does not satisfy then the quaternary representation of (, ) is a pair of Lee-witnesses for f; see [4]. Suppose , , and ; the other cases can be treated analogously. We have and , , , , . Therefore, (, resp.), represented in the quaternary alphabet, coincides with the word obtained from f by replacing T (G, resp.) in position i (j, resp.) with C (A, resp.), that is (, resp.). Moreover, , represented in the quaternary alphabet, coincides with and the thesis follows.
If the 2-Lee-error overlap of f of shift r satisfies and then the 2-Lee-error overlap of of shift satisfies . The proof is given for and ; the other cases can be treated analogously. Then, and . Since the 2-Lee-error overlap of f of shift r satisfies , then r is even, . Therefore, is even and it is equal to the half of , the shift of the 2-Lee-error overlap of . Moreover, from and , it follows and . At last, , implies . In this case, from [4], , , represented in the quaternary alphabet, is a pair of Lee-witnesses for f. Recall that and . Then, will be obtained from by replacing with . Therefore, its quaternary representation coincides with the word obtained from f by replacing A in position i with C, that is . Analogous considerations show that and and the thesis follows.
This case can be treated as case a. Indeed, if the 2-Lee-error overlap of f of shift r satisfies and , it can be shown that f has another 2-Lee-error overlap, of shift , that does not satisfy . In particular, where are such that and , for some and . □
Let and, hence, . Then, has a 2-Lee-error overlap of length and shift . In this case and is the unique error position with and , that is the error is caused by two complementary symbols. A pair of witnesses for can be constructed from a pair of witnesses for , as follows. Consider the 2-Lee-error overlap of of even length and shift . It is caused by two different error positions, and , since , while . Starting from this 2-Lee-error overlap of , the pair of Lee-witnesses for can be obtained. The pair is composed of and . In this case, the 2-Lee-error overlap of does not satisfy , since is different from . Therefore, coming back to the quaternary representation, is a pair of Lee-witnesses for .
On the other hand, this pair of witnesses for can be directly constructed from , without going through the binary representation. Following Proposition 3 in the case , , . Observe that and . Hence, and . Finally, is the same pair of Lee-witnesses for , as previously computed using the binary representation.
Let us come back to the computation of the index of a non-isometric word f. Propositions 2 and 3 show how to construct some pairs of witnesses for f, referring to the Hamming distance and to the Lee distance, respectively. Next Theorems 2 and 3 prove that such pairs of witnesses are in some sense the minimal ones and that the index can be obtained as the minimum length of such witnesses.
First, observe that Proposition 2 (Proposition 3, resp.) allows to restate in the next corollary some bounds on the Ham-index (Lee-index, resp.) of a Ham-non-isometric (Lee-non-isometric, resp.) word f, already given in [20] for the binary case and in [3] for the general case. Just observe that , , and can hold only in case of a 2-Lee-error overlap of a quaternary word with an error caused by two complementary symbols.
Let Σ be a k-ary alphabet and let. If f is a Ham-non-isometric word, then the Ham-index of f satisfies. Ifand f is a Lee-non-isometric word, then the Lee-index of f satisfies.
Let Σ be a k-ary alphabet,be a Ham-non-isometric word andbe a pair of Ham-witnesses for f of minimalamong all Ham-witnesses for f of minimum length. Let V be the set of all the positions where u and v differ and, for any, letbe the word obtained from u by replacingwith.
Then, for any, f is a factor of.
Moreover, denoted bythe interval of positions covered by f in, there exist, such that bothandcontain s and t.
Let be a pair of Ham-witnesses for f of minimal length . Let and suppose are chosen so that h is minimal among all pairs of Ham-witnesses for f of length d. Let , with , be the set of all the positions where u and v differ. Let us show that for any , f is a factor of . In fact, if were f-free then the pair would be a pair of Ham-witnesses for f of length d and contradicts the minimality of h.
Observe that contains i, because u is f-free. Moreover, it must contain at least another position in V, so that v could be f-free; let denote the smallest of such positions in . Note that can be smaller or greater than i. However, for the first and the last error positions, one necessarily has that whereas . Let be the smallest position in V, such that ; then , with .
Since contains and , with , then also contains because ; recall that and are consecutive error positions, no other error position can lay in between. Similarly, . Finally, and both contain and and therefore they can play the role of and in the statement. □
Let f be a Ham-non-isometric k-ary word andbe a pair of Ham-witnesses for f of minimal distanceamong all Ham-witnesses for f of minimal length. Then, or, is equal toor to, for some r that is the shift of a 2-Ham-error overlap of f.
Let f be a Ham-non-isometric k-ary word and . Let be a pair of Ham-witnesses of minimal distance among all Ham-witnesses of minimal length. By definition, . Let V be the set of all error positions between u and v. For any position let be the word obtained from u replacing with . Lemma 2 states that f is a factor of ; let be the interval of positions covered by f in . The minimality of the length of u implies that the union of all intervals , for , gives . Let be the interval that contains position 1 and be the interval that contains position n. Note that positions , are not supposed to be in increasing order.
If and both contain positions and of u, then f has a 2-Ham-error overlap of shift and error positions and . Then, , and , or vice versa, and the claim is proved.
Otherwise, . From Lemma 2, there exist , such that and both contain s and t. This provides a 2-Ham-error overlap of f; let z be its shift. Consider the pair . Note that in this case and then position 1, or n, does not belong neither to nor to ; suppose it is n. Therefore, the length of is strictly less than . The minimality of the length of u implies that cannot be a pair of Ham-witnesses for f. This can happen only if is not f-free. Hence, as stated in item 3. of Proposition 1, i.e. f shows a sort of “periodic decomposition”; furthermore, such a decomposition of f is unique (see Lemma 1). Consider now , , and and suppose that occurs before . Interval necessarily overlaps (and ) because (see Corollary 1) and they overlap each other with the largest possible intersection since is minimal. Looking in detail at the periodic decomposition of f stated in Proposition 1, such largest intersection has length and finally and . □
Let f be a Ham-non-isometric k-ary word. Then, the Ham-index of f,, is the minimum length of wordsor, as appropriate, where r is taken over the shifts of all 2-Ham-error overlaps of f.
Let f be a Lee-non-isometric k-ary word,, andbe a pair of Lee-witnesses for f of minimal distanceamong all Lee-witnesses for f of minimal length. Then, or, is equal to,, or to.
If , then the Lee-index equals the Ham-index (Remark 2) and Theorem 2 applies. Let and f be a Lee-non-isometric k-ary word. Let be the pair of Lee-witnesses for f of minimal distance among all Lee-witnesses for f of minimal length. Then, . Let V be the set of all error positions. If all errors are due to non-complementary symbols, then the proof follows as in the Ham-index case (Theorem 2).
Suppose now that there exists such that and are complementary symbols. Then, can be replaced in two different ways in a Lee-transformation from u to v, since there are two symbols at Lee-distance 1 from . The minimality of the distance of u and v implies that the word obtained from u replacing with any of the two symbols at Lee-distance 1, contains f as a factor. Then, position i is contained in two overlapping occurrences of f. Hence, f has a 2-Lee-error overlap with a unique error position; let r be its shift. Consider the words and . Proposition 3 states that is a pair of Lee-witnesses for f. Finally, and , or vice versa. □
Let f be a Lee-non-isometric k-ary word,. Then, the Lee-index of f,, is the minimum length of words,, or, as appropriate, where r is taken over the shifts of all 2-Lee-error overlaps of f.
Corollaries 2 and 3 suggest to compute the index of a word considering all its 2-error overlaps, obtaining for each one the corresponding pair of witnesses as in Propositions 2 and 3, and then computing the minimal length of the so obtained witnesses. The algorithms in next section will look for all possible 2-error overlaps in increasing order of their shift. Nevertheless, note that it is not possible to stop at the first found 2-error overlap. The 2-error overlap which corresponds to the witnesses of minimal length can be a subsequent one, as shown in Example 7.
Consider the ternary alphabet . Recall that in this case the Hamming and the Lee distance coincide (cf. Remark 2). For any , let and , with . It can be observed that, for any , f has two 2-error overlaps, one of length , for which and one of length , for which . Then, for any , f is non-isometric applying Theorem 1. The first pair of witnesses for f, in increasing order of shift, can be constructed starting from the 2-error overlap of length and shift . This overlap satisfies and the corresponding pair of witnesses for f is , with and . The length of this pair of witnesses is . The second pair of witnesses for f can be constructed starting from the 2-error overlap of length . This overlap does not satisfies and the corresponding pair of Lee-witnesses for f is , with and . The length of this pair of witnesses is . Thus, the 2-error overlap that corresponds to the witnesses of minimal length is the second one in increasing order of shift, and then it provides the index .
Let be a non-isometric word. Consider a pair of witnesses for f and suppose that is a pair of minimum length among all witnesses for f. This does not imply that , is a pair of witnesses for f of minimum distance. Also the converse holds. In fact, minimality in length does not coincide with minimality in distance, as shown in the following example.
Let Δ be the quaternary alphabet and . The word f has a 2-Ham-error overlap of shift and a 2-Ham-error overlap of shift .
Consider the 2-Ham-error overlap of shift ; this 2-Ham-error does not satisfy and, following Proposition 2, we have that is a pair of Ham-witnesses for f. Note that .
Now, consider the 2-Ham-error overlap of shift ; we have that disagrees from in positions and , with , , , and . Therefore, this 2-Ham-error overlap satisfies and, following Proposition 2 in the case , we have that is a pair of Ham-witnesses for f. Note that and .
Since f has no other 2-Ham-error overlap it follows that is a pair of Ham-witnesses of minimum length 8 for f, but it is not of minimum Hamming distance, since . It can be shown that the same result holds for when the Lee distance is considered.
Furthermore, in [19], it is shown the example of a binary word for which there is a unique pair of Ham-witnesses of minimum length, whose distance is not minimum, while the pair of Ham-witnesses of minimum distance is not of minimum length. The word is . It has two 2-Ham-error overlaps, the one of shift 6 that satisfies and the one of shift 10 that does not satisfy . Then, , while , and thus . On the other hand, .
Characterization of words of minimum/maximum index
Corollary 1 provides some bounds on the index of non-isometric words. Indeed, the Ham-index of a Ham-non-isometric word f satisfies and the Lee-index of a Lee-non-isometric word f satisfies the same bounds . In this section, we show that the bounds are tight and explicitly characterize the words of minimum and maximum index.
Let us start with the characterization of the k-ary Ham-non-isometric words f with minimum Ham-index equal to .
Let Σ be a k-ary alphabet withand letbe a Ham-non-isometric word. Then,if and only if, for some,with.
Suppose . Theorem 2 implies that or , for some 2-Ham-error overlap of shift r. Recall that and thus it cannot be since r is an integer. Therefore, holds. Since , and f has a 2-Ham-error overlap of length with error positions i, j, . This implies that while for some symbols x, y with , and that while for some symbol z with . Therefore, .
Suppose now that , for some , with and . Then, f has a 2-Ham-error overlap of length and shift . This 2-Ham-error overlap does not satisfy , because is not even. From Proposition 2, is a pair of Ham-witnesses for f. Note that and since, from Corollary 1, . □
The following proposition characterizes Ham-non-isometric words with maximum Ham-index.
Let Σ be a k-ary alphabet withand letbe a Ham-non-isometric word. Then,if and only if
f has exactly one 2-Ham-error overlap and it satisfies
and
, for someand,.
Suppose, first, that . From Theorem 2, or , for some 2-Ham-error overlap of shift r. In fact cannot hold since and, then, . Recall that any pair of Ham-witnesses corresponds to a 2-Ham-error overlap that does not satisfy . Therefore, f has no such 2-Ham-error overlaps, that is f has only 2-Ham-error overlaps that satisfy . From Lemma 1, f has exactly one 2-Ham-error overlap that satisfies and the corresponding pair of Ham-witnesses, , is such that . Let i, j be the error positions of this 2-Ham-error overlap. From Proposition 1, , with , , , , and ρ (σ, resp.) a suffix (a prefix, resp.) of w. Moreover, if (i.e. ) then f would have a 2-Ham-error overlap that does not satisfy (cf. Remark 4), against the hypotheses. Therefore, and . Moreover, and then . Therefore, and this implies and , i.e. .
Suppose, now, that , for some and , and that f has only 2-Ham-error overlaps that satisfy . From Lemma 1, f has exactly one 2-Ham-error overlap that satisfies . Since , this is the 2-Ham-error overlap with shift and error positions and . From Proposition 2, the pair is a pair of Ham-witnesses for f. Note that . Since f has no other 2-Ham-error overlap, from Theorem 2, . □
Let us now consider the Lee-index. Recall that the Lee-index is equal to the Ham-index, when (Remark 2). On the other hand, the Lee-index of a word is always equal to its length, when . Hence, consider the case and . The following proposition characterizes the Lee-non-isometric words with minimum Lee-index.
Letbe a Lee-non-isometric word. Then,if and only if
, for some,with
or
, for some,with.
Suppose . Propositions 3 states that or or , for some 2-Lee-error overlap of shift r and error positions i, j, with ( iff ).
Recall that and thus it cannot be , since r is an integer.
If , then the 2-Lee-error overlap of shift r has two different error positions (i.e. ). Moreover, since then and f has a 2-Lee-error overlap of length with two different error positions . This implies that while for some symbols x, y with , and that while for some symbol z with . Therefore, .
On the contrary, if , then the 2-Lee-error overlap of shift r has only one error position (i.e. ). Moreover, since , then and f has a 2-Lee-error overlap of length with only one error position i. This implies , for some with .
Suppose now that , for some , with . Then, f has a 2-Lee-error overlap of length and shift . This 2-Lee-error overlap does not satisfy , because is not even. From Proposition 3, case , is a pair of Lee-witnesses for f. Note that and , since, from Corollary 1, .
Suppose now that , for some , with . Then f has a 2-Lee-error overlap of length and shift , with only one error position. From Proposition 3, case , is a pair of Lee-witnesses for f. Note that and , since, from Corollary 1, . □
Next proposition characterizes Lee-non-isometric words over Δ which have maximum Lee-index.
Letbe a Lee-non-isometric word. Thenif and only if
f has exactly one 2-Lee-error overlap and it satisfies
and
, for someand,or, for someand, with.
Suppose, first, that . Theorem 3 implies that or or , for some 2-Lee-error overlap of shift r, and error positions i, j, with ( iff ).
In fact, cannot hold since and, then, .
If then the 2-Lee-error overlap of shift r has only one error position (i.e. ). Moreover, since , if then and f has a 2-Lee-error overlap of length 1 with only one error position. This implies , for some and , with
If, instead, , then the 2-Lee-error overlap of shift r has two different error positions (i.e. ) and it satisfies . A reasoning analogous to the one used in the proof of Proposition 5 shows that , for some and , .
Note that the case rules out the case and vice versa. In fact if , then the first and the last symbol of f have Lee-distance equal to 2 whereas, if , then they have Lee-distance equal to 1.
Moreover, in both cases, f has exactly one 2-Lee-error overlap, either the one that allows the construction of or the one that allows the construction of . If f would have another different 2-Lee-error overlap, this would give a pair of Lee-witnesses for f of length less than , contradicting .
Suppose now that f has exactly one 2-Lee-error overlap and that , for some and , with . Then, the unique 2-Lee-error overlap of f is the one length 1 and shift . From Proposition 3, case , is a pair of Lee-witnesses for f. Note that and, since f has no other 2-Lee-error overlap, it follows (cf. Theorem 3).
If , for some and , , then f has the 2-Lee-error overlap of shift and error positions and which satisfies . Since f has exactly one 2-Lee-error overlap, . Note that and finally . □
Algorithms for computing index and witnesses
The results given in Section 3 are now applied to design an algorithm that computes the Ham-index of a word and yields a pair of Ham-witnesses of minimal length, when the word is Ham-non-isometric. An analogous algorithm is then provided when the Lee-distance is concerned instead of the Hamming one. Recall that if f is isometric then its index is ∞; otherwise, it is the minimal length of two words u, v, such that is a pair of witnesses for f.
Assume that Σ is a k-ary alphabet with and is a word of symbols in Σ, where n is the length of f and ’s are its symbols. Note that in this section the positions of f start from 0, and not from 1, according to usual notation in programming languages. Following the definitions in Section 2, the suffix of f starting at position i, is denoted by , where the subscript refers to its length.
The algorithm for computing Ham-index and Ham-witnesses
Let us sketch an algorithm which inputs a k-ary non-empty word f of length n and outputs an integer I, that is the Ham-index of f, and, when the word is Ham-non-isometric, a pair of Ham-witnesses for f of length I. Its pseudo-code is given in Algorithm 1 and an example follows.
Description
Algorithm 1 starts with a call to function TwoErrorOverlaps that looks for all 2-Ham-error overlaps of f, using the Kangaroo method [12,17]. The Kangaroo method finds the mismatches in a given alignment of two words by “jumping” from one error to the next, as here explained.
Recall that a 2-Ham-error overlap of f is a suffix of f that differs from the prefix of the same length in exactly 2 positions. For each position i, with , TwoErrorOverlaps compares the suffix starting at i with . Note that the first position where the suffixes and (possibly) differ is the position just after the longest common prefix of the two suffixes. More precisely, , where for any , a call to returns the length of the longest common prefix of suffixes and . Then, the second position where the suffixes and (possibly) differ can be obtained as the position just after the longest common prefix of the suffixes and .
The function TwoErrorOverlaps uses a variable l which gives the length of the current overlap of and and a variable d which contains their current Hamming distance. Variable d is increased when a mismatch has been found, until and , meaning that has been entirely compared with and exactly 2 mismatches have been found. More precisely, TwoErrorOverlaps with input f finds all 2-Ham-error overlaps of f and stores their lengths in the list . At the same time, for each 2-Ham-error overlap found, the corresponding error positions are kept in the list , which is, therefore, a list of lists. The function works similarly as the first algorithm in [8], which instead stops when a 2-Ham-error overlap is detected.
If no 2-Ham-error overlap of f is found, then Algorithm 1 sets the Ham-index I to ∞ and returns I and the pair .
Otherwise, f is Ham-non-isometric and I is initialized as (in Line 1), since the Ham-index of f is upper bounded by in Corollary 1. Then, for each 2-Ham-error overlap, the algorithm constructs a pair of Ham-witnesses calling the function WitnessesConstructor and possibly update i and . According to Proposition 2 and Theorem 2, the construction depends on whether the is satisfied (see Definition 6). If it is not satisfied, the pair may be constructed according to case 1. of Proposition 2. When is satisfied, if , the pair will be constructed as in case 2. Otherwise, it is case 3. and there is nothing else to do because it is proved that f always has another 2-Ham-error overlap of shift bigger than r, and thus the pair will be constructed in a subsequent call of the function.
The function CondPlus inputs the word f, the integers r, i and j, where i and j are the error positions in the 2-Ham-error overlap of shift r, and outputs if is verified, otherwise. Note that can hold only if r is even. If r is even, then the function checks the other conditions in , , and . In particular, is iff . This check is done testing if , rather than by comparing all the symbols.
Algorithm 1 computes step by step I as the minimal length of all the Ham-witnesses considered, and it stores in a pair of Ham-witnesses of such length. At the end, I is the Ham-index of f, according to Theorem 2, and Algorithm 1 outputs I and a corresponding pair of Ham-witnesses.
Computing the Ham-index and Ham-witnesses for f
Analysis
The efficiency of Algorithm 1 mostly depends on the computation of , i.e., the length of the longest common prefix of the suffixes of f starting at position i and j, respectively. Recall that gives the next error position in an alignment of the two suffixes, following the Kangaroo method. Let us sketch how can be computed in constant time, using some additional data structure for f and its suffixes taking space and time to be constructed. This allows to check, for a given position i in f, whether f has a 2-Ham-error overlap of length in time .
Suffix trees are a well-known data structure to manage the suffixes of a given word. The suffix tree of a word has n leaves labelled 0, 1,…, . For , the leaf i represents the suffix which is the label of the path from the root to leaf i. Actually, for , , where is the depth of the lowest common ancestor of leaves i and j in the suffix tree of f. To improve the efficiency of the algorithm, the information about the suffixes of f will not be handled by a suffix tree, but by some linear space data structures, namely, the Suffix Array (), its inverse () and the Longest Common Prefix () array (see [10] for the definitions). J. Fischer and V. Heun have shown that (or , equivalently) can be computed in constant time, using such data structures and a constant time algorithm to ask Range Minimum Queries () [11]. In fact, the following formula holds:
Let Σ be a k-ary alphabet,, andbe a non-empty word of length n.
Algorithm
1
with input f computes the Ham-index of f and (possibly) a pair of Ham-witnesses of minimal length for f intime with additionalspace.
Let us analyse Algorithm 1, starting from its functions.
TwoErrorOverlaps inputs the word f and outputs all lengths of its 2-Ham-error overlaps in the list and all corresponding error positions in the list . Applying the Kangaroo method, a call to can be done in time, with additional space. For a fixed i, TwoErrorOverlaps makes at most two calls to and hence, it can check whether f has a 2-Ham-error overlap of length in time . Finally, TwoErrorOverlaps runs in time and space.
CondPlus inputs the word f, the integers r, i and j, where i and j are the error positions in the 2-Ham-error overlap of shift r. It verifies if r is even and if conditions , , and hold true. Note that conditions , , and can all be checked in time. In particular, , i.e., , is verified by a single call to , that can be executed in time, with the aforementioned data structures for f. Hence, the overall time complexity of CondPlus is .
WitnessesConstructor inputs the word f, the length l of a 2-Ham-error overlap of f, its corresponding error positions in the list and outputs a pair of Ham-witnesses for f. According to Proposition 2, the function constructs the pair in two different ways, following that CondPlus returns or with . Since CondPlus runs in time and all the other instructions can be executed in constant time, the time complexity of WitnessesConstructor is .
Let us now complete the analysis of Algorithm 1. The main algorithm contains one call to TwoErrorOverlaps and at most calls to WitnessesConstructor, since the 2-Ham-error overlaps of f are at most . Therefore, the overall time complexity of Algorithm 1 can be obtained as the sum of the cost of TwoErrorOverlaps and times the cost of WitnessesConstructor. Thus, it is .
The space complexity of Algorithm 1 is due to the data structures needed to run in constant time, both in TwoErrorOverlaps and in CondPlus; thus it is . □
Let us run Algorithm 1 to compute the Ham-index and a pair of Ham-witnesses for (as in Example 3). It starts calling the function TwoErrorOverlaps with input . This function finds two 2-Ham-error overlaps. The first one is of length 4, where the errors are in positions 0, 2; in fact, and . The second one is of length 3, where the errors are still in positions 0, 2; in fact, and . Thus, TwoErrorOverlaps outputs and . Then, coming back to Line 1 of the main algorithm, because is not empty, the algorithm initializes the output variable . For to 1 the algorithm calls the function WitnessesConstructor.
The first call takes as input and sets , , , and after calling CondPlus. Thus, the function WitnessesConstructor computes and outputs and , obtained as and . Since , the algorithm updates and . The second call to WitnessesConstructor takes as input and sets , , , and after calling CondPlus. Indeed, r is even and the conditions , , and are . Thus, the function outputs and , obtained as and . Since is greater than the algorithm does not update neither I nor . Therefore, the main algorithm outputs the Ham-index and the pair of Ham-witnesses .
The algorithm for computing Lee-index and Lee-witnesses
This section shows an algorithm which inputs a quaternary non-empty word f of length n; it outputs an integer I, the Lee-index of f, and a pair of Lee-witnesses for f of length I, when the word is Lee-non-isometric. Recall that in the case , the Lee-distance and the Lee-index coincide with the Hamming ones; hence Algorithm 1 applies. Furthermore, for , every word f is Lee-non-isometric and its Lee-index is equal to its length. Algorithm 2 is a slight variation of Algorithm 1 made to work with the Lee distance, instead of the Hamming one, while keeping the same complexity.
Description
Algorithm 2 is a slight variation of Algorithm 1. It is presented in its complete form for the sake of completeness. Let us focus on the differences existing between the two algorithms.
Algorithm 2 still starts with a call to function TwoErrorOverlaps to compute all 2-Lee-error overlaps of f, using the Kangaroo method. Function TwoErrorOverlaps compares the suffix starting at i with ; this is done in Line 2 for each position i, with , and not for as in Algorithm 1, since the length of a 2-Lee-error overlap can be 1, too, differently from the 2-Ham-error overlaps whose length is at least 2. Then, in Line 2 the considered distance is the Lee one, instead of the Hamming one.
Subsequently, if some 2-Lee-error overlap of f was found, Algorithm 2 constructs a pair of Lee-witnesses calling the function WitnessesConstructor for each 2-Lee-error overlap. According to Proposition 3 and Theorem 3, the function constructs the pair in two different ways, following that the 2-Lee-error overlap is caused by two complementary symbols (i.e., and the list has only one element) or by two non-complementary symbols (i.e., and the list has two elements). The case was not possible with the Hamming distance, therefore, Lines 2-2 must be added. If , the function constructs u as by appending to the prefix of f of length the word obtained as specified in Definition 5. Similarly, it constructs v as , this time appending . The case goes as for the Hamming distance and no change need to be done.
Computing the Lee-index and Lee-witnesses for f
Analysis
Let Σ be a k-ary alphabet,, andbe a non-empty word of length n.
Algorithm
2
with input f computes the Lee-index of f and a pair of Lee-witnesses of minimal length for f intime with additionalspace.
Algorithm 2 is a slight variation of Algorithm 1 and it keeps the same time and space complexity. Indeed, the changes made to Algorithm 1 concern: one more repetition of the loop in Line 2, the evaluation of the Lee distance, instead of the Hamming one in Line 2, and the computation of a further pair of Lee-witnesses when in Lines 2-2. All these modifications add only some extra constant time. □
Let us run Algorithm 2 on . The call to function TwoErrorOverlaps finds two 2-Lee-error overlaps. The first one is of length 4, where the errors are in positions 1, 3 and are caused by non-complementary symbols; in fact, and . The second one is of length 2, and has a unique error position 1; in fact, and , since G and C are complementary symbols. Thus, TwoErrorOverlaps outputs and .
Then, coming back to Line 3 of the main algorithm, because is not empty, the algorithm initializes the output variable . For to 1 the algorithm calls the function WitnessesConstructor.
The first call takes as input and sets , , . Because , then and after calling CondPlus. Thus, the function WitnessesConstructor computes and outputs and , obtained as and . Since , the algorithm updates and .
The second call to WitnessesConstructor takes as input and sets , , . Because , then the function outputs and , obtained as and . Since is greater than the algorithm does not update neither I nor .
Hence, Algorithm 2 outputs the Lee-index and the pair of Lee-witnesses .
Conclusion
The goal of this paper was to investigate the notion of k-ary isometric word, with , when the isometricity is defined in terms of the Hamming distance and of the Lee distance. Deciding whether a word is isometric can be efficiently done using a characterization of non-isometric words as the ones showing a particular overlap with errors. The evidence that a given word f is non-isometric can be given by exhibiting a pair of f-free words whose transformations cannot avoid the factor f. Such pair of words is called a pair of Ham-witnesses (Lee-witnesses, resp.) and the minimal length of the witnesses is called the Ham-index (Lee-index, resp.) of f. With the aim of laying the foundations for an efficient computation of the index and the pairs of witnesses of non-isometric k-ary words, first, the necessary theoretical results have been provided. The results allowed to design Algorithm 1 and Algorithm 2 which take a word and, in linear time and space, outputs its index and a minimal pair of witnesses when the word is non-isometric.
Lee-non-isometric words in starting with A
Word (f)
2eo-length (l)
2eo-shift (r)
m
Type
Witness1 (u)
Witness2 (v)
Index ((f))
AAT
2
1
1
FP
AAGT
AACT
4
1
2
1
FP
AAGAT
AACAT
ACA
2
1
2
AB
ACCA
AAAA
4
ACT
2
1
2
AB
ACCT
AATT
4
1
2
1
FP
ACGCT
ACCCT
ATT
2
1
1
FP
AGTT
ACTT
4
1
2
1
FP
ATGTT
ATCTT
AGA
2
1
2
AB
AGGA
AAAA
4
Algorithms 1 and Algorithm 2 have been implemented in C. Let us conclude by showing in the following tables some data obtained running Algorithm 2. The tables list the Lee-non-isometric words f in that starts with A, of length 3 in Table 1, and of length 4 in Table 2. For each word f, the second and the third columns report all the 2-Lee-error overlaps by their length, 2eo-length (l), and shift, 2eo-shift (r), respectively. For each 2-Lee-error overlap, m gives the number of error positions (1 or 2), while Type denotes which construction of the Lee-witnesses has been used, according to Proposition 3; namely, AB refers to case 2.a. when a pair is constructed, EG refers to case 2.b. when a pair is constructed, and FP refers to case 1. when a pair is constructed. The corresponding pair of Lee-witnesses is given by as reported in the sixth and seventh columns. The last column contains the Lee-index of word f, that is the minimum length of its witnesses.
Lee-non-isometric words in starting with A
Word (f)
2eo-length (l)
2eo-shift (r)
m
Type
Witness1 (u)
Witness2 (v)
Index ((f))
AAAT
3
1
1
FP
AAAGT
AAACT
5
2
2
1
FP
AAAGAT
AAACAT
1
3
1
FP
AAAGAAT
AAACAAT
AACA
3
1
2
AB
AACCA
AAAAA
5
AACC
2
2
2
EG
AACACCC
AAACACC
7
AACT
3
1
2
AB
AACCT
AAATT
5
1
3
1
FP
AACGACT
AACCACT
AACG
2
2
2
AB
AACACG
AAAGCG
6
AATA
2
2
1
FP
AAGATA
AACATA
6
AATT
3
1
1
FP
AAGTT
AACTT
5
1
3
1
FP
AATGATT
AATCATT
AAGA
3
1
2
AB
AAGGA
AAAAA
5
AAGC
2
2
2
AB
AAGAGC
AAACGC
6
AAGT
3
1
2
AB
AAGGT
AAATT
5
1
3
1
FP
AAGGAGT
AAGCAGT
AAGG
2
2
2
EG
AAGAGGG
AAAGAGG
7
ACAA
3
1
2
AB
ACCAA
AAAAA
5
ACAT
1
3
1
FP
ACAGCAT
ACACCAT
7
ACAG
2
2
1
FP
ACATAG
ACAAAG
6
ACCA
3
1
2
AB
ACCCA
AACAA
5
2
2
2
AB
ACCCCA
ACAACA
ACCT
3
1
2
AB
ACCCT
AACTT
5
2
2
2
AB
ACCCCT
ACATCT
1
3
1
FP
ACCGCCT
ACCCCCT
ACTC
2
2
1
FP
ACGCTC
ACCCTC
6
ACTT
3
1
2
AB
ACCTT
AATTT
5
1
3
1
FP
ACTGCTT
ACTCCTT
ACGA
2
2
2
AB
ACGCGA
ACAAGA
6
ACGT
2
2
2
AB
ACGCGT
ACATGT
6
1
3
1
FP
ACGGCGT
ACGCCGT
ATAA
2
2
1
FP
ATACAA
ATAGAA
6
ATAT
1
3
1
FP
ATAGTAT
ATACTAT
7
ATCC
2
2
2
AB
ATCTCC
ATACCC
6
ATCT
1
3
1
FP
ATCGTCT
ATCCTCT
7
ATCG
2
2
2
AB
ATCTCG
ATAGCG
6
ATTT
3
1
1
FP
AGTTT
ACTTT
5
2
2
1
FP
ATGTTT
ATCTTT
1
3
1
FP
ATTGTTT
ATTCTTT
ATGC
2
2
2
AB
ATGTGC
ATACGC
6
ATGT
1
3
1
FP
ATGGTGT
ATGCTGT
7
ATGG
2
2
2
AB
ATGTGG
ATAGGG
6
AGAA
3
1
2
AB
AGGAA
AAAAA
5
AGAC
2
2
1
FP
AGAAAC
AGATAC
6
AGAT
1
3
1
FP
AGAGGAT
AGACGAT
7
AGCA
2
2
2
AB
AGCGCA
AGAACA
6
(Continued)
Word (f)
2eo-length (l)
2eo-shift (r)
m
Type
Witness1 (u)
Witness2 (v)
Index ((f))
AGCT
2
2
2
AB
AGCGCT
AGATCT
6
1
3
1
FP
AGCGGCT
AGCCGCT
AGTT
3
1
2
AB
AGGTT
AATTT
5
1
3
1
FP
AGTGGTT
AGTCGTT
AGTG
2
2
1
FP
AGGGTG
AGCGTG
6
AGGA
3
1
2
AB
AGGGA
AAGAA
5
2
2
2
AB
AGGGGA
AGAAGA
AGGT
3
1
2
AB
AGGGT
AAGTT
5
2
2
2
AB
AGGGGT
AGATGT
1
3
1
FP
AGGGGGT
AGGCGGT
For example, referring to Table 2, there are three 2-Lee-error overlaps for ; the first one has length 3 and shift 1, the second one has length 2 and shift 2, and the third one has length 1 and shift 3. Each one has a single error position () due to the occurrence of complementary symbols A and T. The corresponding pairs of Lee-witnesses are all of type FP; they are constructed as following case 1. of Proposition 3. They are , , and . The Lee-index is given by the length of the shortest witnesses, .
Summing up, there are 38 Lee-non-isometric words f in that starts with A and have length 4. From these, 15 words have minimum Lee-index , whereas 7 words have maximum Lee-index ; the remaining 16 words have Lee-index 6. Moreover, there are 25 words with a single 2-Lee-error overlap, 9 words with two 2-Lee-error overlaps, and 4 with three 2-Lee-error overlaps. Overall, there are 152 Lee-non-isometric words of length 4 among all 256 words in . The asymptotic behaviour of the density of Ham-non-isometric binary words was determined in [16]. Recently, the density of Ham- and Lee-non-isometric k-ary words has been evaluated in [6].
Footnotes
Acknowledgements
The authors acknowledge support by INdAM-GNCS Project 2023, FARB Project ORSA229894 of University of Salerno, FFR and PNRR MUR Project ITSERR CUP B53C22001770006 of University of Palermo, TEAMS Project and PNRR MUR Project PE0000013-FAIR of University of Catania.
References
1.
M.Anselmo, G.Castiglione, M.Flores, D.Giammarresi, M.Madonia and S.Mantaci, Hypercubes and isometric words based on swap and mismatch distance, in: Descriptional Complexity of Formal Systems. DCFS 2023, Lect. Notes Comput. Sci., Vol. 13918, Springer, 2023, pp. 21–35. doi:10.1007/978-3-031-34326-1_2.
2.
M.Anselmo, G.Castiglione, M.Flores, D.Giammarresi, M.Madonia and S.Mantaci, Isometric words based on swap and mismatch distance, in: Developments in Language Theory. DLT23, Lect. Notes Comput. Sci., Vol. 13911, Springer, 2023, pp. 23–35. doi:10.1007/978-3-031-33264-7_3.
3.
M.Anselmo, M.Flores and M.Madonia, Quaternary n-cubes and isometric words, in: Combinatorics on Words, T.Lecroq and S.Puzynina, eds, Lect. Notes Comput. Sci., Vol. 12842, Springer International Publishing, 2021, pp. 27–39. doi:10.1007/978-3-030-85088-3_3.
4.
M.Anselmo, M.Flores and M.Madonia, On k-ary n-cubes and isometric words, Theor. Comput. Sci.938(6–7) (2022), 50–64. doi:10.1016/j.tcs.2022.10.007.
5.
M.Anselmo, M.Flores and M.Madonia, Fun slot machines and transformations of words avoiding factors, in: FUN with Algorithms 2022, P.Fraigniaud and Y.Uno, eds, LIPIcs, Vol. 226, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022, pp. 4–1415.
6.
M.Anselmo, M.Flores and M.Madonia, Density of Ham- and Lee- non-isometric k-ary words, in: ICTCS’23 Italian Conference on Theoretical Computer Science, G.Castiglione and M.Sciortino, eds, CEUR Workshop Proceedings, Vol. 3587, CEUR-WS.org, 2023, pp. 116–128.
7.
M.Anselmo, D.Giammarresi, M.Madonia and C.Selmi, Bad pictures: Some structural properties related to overlaps, in: DCFS 2020, G.Jirásková and G.Pighizzini, eds, Lect. Notes Comput. Sci., Vol. 12442, Springer, 2020, pp. 13–25. doi:10.1007/978-3-030-62536-8_2.
8.
M.-P.Béal and M.Crochemore, Checking whether a word is Hamming-isometric in linear time, Theor. Comput. Sci.933(6–7) (2022), 55–59. doi:10.1016/j.tcs.2022.08.032.
9.
G.Castiglione, M.Flores and D.Giammarresi, Isometric words and edit distance: Main notions and new variations, in: Cellular Automata and Discrete Complex Systems. AUTOMATA 2023, Lect. Notes Comput. Sci., Vol. 14152, Springer, 2023, pp. 6–13.
10.
M.Crochemore and W.Rytter, Jewels of Stringology, World Scientific, 2002. ISBN 978-981-02-4782-9.
11.
J.Fischer and V.Heun, Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE, in: Combinatorial Pattern Matching, 17th Annual Symposium, M.Lewenstein and G.Valiente, eds, Lect. Notes Comput. Sci., Vol. 4009, Springer, 2006, pp. 36–48. doi:10.1007/11780441_5.
12.
Z.Galil and R.Giancarlo, Improved string matching with k mismatches, SIGACT News17(4) (1986), 52–54. doi:10.1145/8307.8309.
13.
W.Hsu, Fibonacci cubes-a new interconnection topology, IEEE Transactions on Parallel and Distributed Systems4(1) (1993), 3–12. doi:10.1109/71.205649.
S.Klavžar, Structure of Fibonacci cubes: A survey, J. Comb. Optim.25(4) (2013), 505–522. doi:10.1007/s10878-011-9433-z.
16.
S.Klavžar and S.V.Shpectorov, Asymptotic number of isometric generalized Fibonacci cubes, Eur. J. Comb.33(2) (2012), 220–226. doi:10.1016/j.ejc.2011.10.001.
17.
G.M.Landau and U.Vishkin, Efficient string matching in the presence of errors, in: 26th Annual Symposium on Foundations of Computer Science, IEEE Computer Society, 1985, pp. 126–136. doi:10.1109/SFCS.1985.22.
18.
J.Wei, The structures of bad words, Eur. J. Comb.59 (2017), 204–214. doi:10.1016/j.ejc.2016.05.003.
19.
J.Wei, Y.Yang and X.Zhu, A characterization of non-isometric binary words, Eur. J. Comb.78 (2019), 121–133. doi:10.1016/j.ejc.2019.02.001.
20.
J.Wei and H.Zhang, Proofs of two conjectures on generalized Fibonacci cubes, Eur. J. Comb.51 (2016), 419–432, ISSN 0195-6698. doi:10.1016/j.ejc.2015.07.018.