Phylogeny Construction with Rigid Gapped Motifs

Abstract

Patterns with gaps have traditionally been used as signatures of protein families or as features in binary classification. Current alignment-free algorithms construct phylogenies by comparing the repertoire and frequency of ungapped blocks in genomes and proteomes. In this article, we measure the quality of phylogenies reconstructed by comparing suitably defined sets of gapped motifs that occur in mitochondrial proteomes. We study the dependence between the quality of reconstructed phylogenies and the density, number of solid characters, and statistical significance of gapped motifs. We consider maximal motifs, as well as some of their compact generators. The average performance of suitably defined sets of gapped motifs is comparable to that of popular string-based alignment-free methods. Extremely long and sparse motifs produce phylogenies of the same or better quality than those produced by short and dense motifs. The best phylogenies are produced by motifs with 3 or 4 solid characters, while increasing the number of solid characters degrades phylogenies. Discarding motifs with low statistical significance degrades performance as well. In maximal motifs, moving from the smallest basis to bases with higher redundancy leads to better phylogenies.

1. Introduction

Current alignment-free algorithms build phylogenies by comparing the repertoire and frequency of substrings in genomes and proteomes. Substrings are meant in this context as short, solid blocks that do not allow any form of flexibility in their sequence. A family of methods avoids to consider the potentially quadratic number of all distinct substrings by bounding their maximum length and observing convergence when length 5 or larger is reached (Apostolico et al., 2010b; Qi et al., 2004; Sims et al., 2009; Vinga, 2007; Vinga and Almeida, 2003). A recent alternative points to the linear set of maximal substrings—i.e., strings that cannot be extended in any direction without reducing their number of occurrences (often called support)—as sufficient to grasp essential phylogenetic information (Apostolico, 2010).

The purpose of this article is threefold. First, we want to measure for the first time the quality of phylogenies reconstructed by explicitly comparing the composition of structures that do allow mismatches. For “composition” of a string s, we mean here the set of all structures of a given type that occur in s. We turn to rigid gapped motifs in particular, and we compare their phylogenies both to a reference taxonomy and to those generated by popular string-based alignment-free methods. Second, we want to study the relationship between classification power and number of gaps. For “classification power” of a given set of motifs, we mean here the distance between phylogenies reconstructed from the composition of such motifs and corresponding reference phylogenies. We are specifically interested in testing whether extremely sparse motifs carry any phylogenetic signal. To accomplish this goal, we use motifs whose length and sparsity have never been considered before. Even worse than substrings, the number of rigid gapped motifs can grow exponentially in the length of a string. Our third objective is measuring the footprints on classification quality of systematic ways to limit this explosion. We experiment with global and local bounds on the density of motifs, with motifs with high z-score, as well as with maximal motifs, i.e., motifs that cannot be made more specific without losing support. Unfortunately, even maximal motifs grow too fast to be practical: we thus consider bases that are capable of generating the whole set of maximal motifs but grow quadratically or linearly in the length of the string. As a byproduct of this analysis, we report previously unseen regularities in the distribution of maximal motifs and their bases with respect to density and length in mitochondrial proteomes and in corresponding random strings.

This article is organized as follows: Section 2 overviews the few existing methods that construct phylogenies from gapped patterns, tracing their roots to early large-scale compositional analyses. Section 3 describes the types of gapped motifs we study in our experiments, as well as our dataset and measures. Experiments with more than 3600 trees built on approximately 4.4 billion motifs are detailed in Section 4 and summarized in Section 5, where we point the reader to some natural extensions of our methodology.

2. Gapped Patterns In Phylogeny: State Of The Art

Patterns with gaps are a successful formalism to represent structural and functional information in biological sequences: for example, most of the signatures in biologically significant databases like prosite (Sigrist et al., 2010) contain gaps with fixed lengths (Hart et al., 2000; Jonassen et al., 1995), and algorithms for the automatic extraction of gapped motifs in many flavors have flourished (Jonassen et al., 1995; Pisanti et al., 2006). Due to their ability to recapitulate all motifs that occur in a string, maximal motifs have attracted a fertile line of research (Califano, 2000; Grossi et al., 2011; Rigoutsos and Floratos, 1998). Like patterns in manually curated databases, however, maximal motifs extracted by unsupervised algorithms have mostly been applied to build signatures of protein families, with a range of complexity that goes from a single motif to sets of motifs that occur with variable order, multiplicity and position (Darzentas and Rigoutsos, 2005; Liu and Califano, 2001; Liu et al., 2003; Sosinsky et al., 2007).

As mentioned, by composition we mean the set of all structures of a given type that occur in a dataset. The transition from patterns seen as signatures to comprehensive compositional studies probably started with the unsupervised extraction of all maximal motifs with given density bounds and support from the GenPept database (Rigoutsos et al., 1999a), with the aim of building a dictionary of all maximal motifs that occur in any known protein sequence. Correlating such building blocks to structure and function provides a way to understand protein organization, thereby enabling a pipeline for the unsupervised functional and structural annotation of proteomes (Rigoutsos et al., 2002). Motifs in the dictionary have been shown to contain information at multiple levels of abstraction: some motifs are specific to a protein family, others are specific to a phylogenetic taxon, and yet others cross protein families and phylogenetic groups, suggesting themselves as universally reused gapped modules that resonate with solid ones identified earlier (Han and Baker; 1995, Han et al., 1997). The very idea of relating motif composition to phylogeny probably surfaces for the first time in this dictionary, albeit being still seen from a signature viewpoint: the authors of Rigoutsos et al. (1999a) ask for the set of motifs that characterizes a specific clade, that are shared among a given set of clades, and that occur in all known clades, and they provide examples of motifs that are archaea-specific, bacteria-specific, shared between archaea and bacteria and between archaea and eukaryotes. A systematic study on the classification power of gapped motifs is however deferred. The dictionary of motifs was subsequently recompiled using the proteomes of 4 archaea and 13 bacteria (Rigoutsos et al., 1999b): once again, motifs are used for functional annotation and as signatures of protein families, and the compositions of motifs is not compared across the two clades.

The notion of composition vector based on normalized counts of occurrences of gapped patterns looms already in the few other large-scale compositional studies on gapped patterns. These studies systematically collected all occurrences of prosite's regular expressions in the translated intergenic regions of the fly, yeast, and human genomes (Zhang et al., 2002), and in a set of 42 proteomes (Nicodème et al., 2002), respectively, exploiting existing theoretical tools to compute expectation and variance (Atteson, 1998). Both studies revealed subsets of patterns to be overrepresented and other subsets to be underrepresented, showing functional preference in proteomes, and discovering, in intergenic regions, relics of ancient proteins that have been deactivated by accumulated mutations. Such compositional preferences, however, were not used to build phylogenies. Composition vectors based on various notions of gapped patterns have then been extensively built by the string kernel community as a prerequisite to classify biosequences using the svm machinery. For a small sampler, we mention here vectors containing the raw frequency of motifs in the eblocks database (Ben-Hur and Brutlag, 2003)—rigid gapped patterns with substitution groups extracted in an unsupervised way from the SwissProt database using the emotif heuristic—vectors containing the frequency of all maximal rigid gapped motifs with high density occurring in the dataset (Dong et al., 2006), vectors indexed by all possible strings in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$(\sum \cup \{ \bullet \})$$ \end{document} ^* with k solid characters and at most m gaps (Leslie and Kuang, 2003), vectors indexed by k-mers, but containing the number of occurrences of each k-mer as a subsequence with prescribed number and length of gaps (Leslie and Kuang, 2003; Lodhi et al., 2002; Rousu and Jaakkola, 2005), and vectors indexed by all possible pairs of spaced k-mers (Lingner and Meinicke, 2006). These studies, however, applied composition vectors to the task of discriminating between the biological sequences belonging to a class (e.g., a node in the scop tree or a group of enzymes) and those not belonging to a class, rather than to reconstructing hierarchical clusters or entire phylogenies. Perhaps the efforts in this line of research that came closer to the reconstruction of phylogenies were the use of the normalized frequency of short, dense, gapped maximal motifs to detect horizontal gene transfer events (Tsirigos and Rigoutsos, 2005) and to classify variable-length dna fragments coming from several metagenomes (McHardy et al., 2007). In this latter work, a hierarchy of multiclass support vector machines was used to discriminate the members of a phylogenetic taxon from those not belonging to that taxon, at the domain, phylum, class, order and genus level.

To date, few phylogenies inferred from gapped motifs exist, but none of them compares the motif composition of biosequences explicitly. To compute the distance between two sequences x and y, the authors of Höhl et al. (2006) concatenate the realizations of all rigid, gapped, maximal motifs that occur exactly once in both x and y, forming two new sequences x′, y′ of the same length. A conventional maximum-likelihood estimate based on a model of character evolution is then used to compute the distance between x′ and y′. This methodology effectively uses rigid gapped motifs as anchors for local alignments, comparing the characters that fill corresponding gaps in two sequences rather than the repertoire of motifs in the sequences. Moreover, motifs with multiple occurrences in the same sequence are systematically discarded. Motifs with flexible gaps are used to classify mitochondrial genomes in Apostolico et al. (2006), but inside the algorithmic information framework of the Normalized Compression Distance (Cilibrasi and Vitanyi, 2005): the distance between two strings depends here on their mutual compressibility with a greedy offline compressor that iteratively shrinks the pair using the motif that yields the best gain, possibly in lossy mode (Apostolico and Parida, 2003). The motif composition of the two strings is thus compared only implicitly.

As mentioned, the present article investigates also how classification quality varies when moving from all motifs in a family to compact subsets capable of generating the whole family. In rigid gapped motifs, the notion of using a basis to characterize and compare strings without resorting to alignments originated with the very definition of such bases (Pisanti et al., 2005). However, few alignment-free methods study similar issues of minimality. svd is the typical dimension-reduction and denoising step after the construction of composition vectors (Dong et al., 2006; Stuart et al., 2002a, b); however, the features of the resulting orthonormal basis have no clear interpretation as substrings or patterns. Elsewhere (Comin and Parida, 2008; Comin and Verzotto, 2010, 2011) it is conjectured that moving from distances based on common rigid gapped motifs, i.e., on rigid gapped motifs with at least one realization in two strings, to a non-redundant subset with no mutual dependency and capable of generating all common rigid gapped motifs, should improve the performance of kernel methods by removing redundancy. Competitive results are reported in the remote homology detection of proteins, but the distortion on distance that this approach should be capable of avoiding is not quantified empirically nor formally.

Another objective of the present paper is studying what happens to phylogenies when increasingly sparser motifs are used in composition vectors. Sparsity has been systematically penalized in string kernels, typically by weighting gaps with exponentially decreasing functions of their length or number (Leslie and Kuang, 2003; Lodhi et al., 2002; Rousu and Jaakkola, 2005). On the other hand, applications of motif discovery have been extremely liberal with gaps, except rare exceptions (Jonassen et al., 1995), showing that sparse structures do carry biological information. Table 1 summarizes the densities used in a sampler of papers that extract maximal and elementary motifs (as defined in Section 3) from biological sequences. Experiments with gapped string kernels have used comparable or even higher densities than those listed in Table 1 (Leslie and Kuang, 2003).

Table 1.

Diachronic Summary of Papers that Extract Elementary and Maximal Gapped Motifs from Biological Sequences

Reference	Year	k	h	n	o	Dataset
Rigoutsos and Floratos (1998)	1998	3	35	7		Core histone families H3, H4.
		3	35	10		Leghemoglobin family.
Rigoutsos et al. (1999a)	1999	6	15	2	•	All NCBI proteins.
Rigoutsos et al. (1999b)	1999	6	15	2	•	Translated ORFs from 13 Bacteria and 4 Archaea genomes.
Califano (2000)	2000	4	12	variable	•	Histone I protein family.
		4	30			GPCR protein superfamily.
Liu and Califano (2001)	2001	4	8	variable	•	GPCR protein superfamily.
		4	12
		6	12
		8	12
Rigoutsos et al. (2002)	2002	6	15	2	•	All SwissProt proteins.
Liu et al. (2003)	2003	4	6	variable	•	Mammalian odor receptor proteins.
Darzentas and Rigoutsos (2005)	2005	4	8	2	•	Cupredoxin and multicopper oxidase protein families.
Dong et al. (2006)	2006	3	6	10		SCOP families.
Höhl et al. (2006)	2006	4	16	2	•	Artificial polypeptides. Benchmark protein alignments.
Sosinsky et al. (2007)	2007	6	8	7		DNA upstream and downstream orthologous genes in Drosophila species.
McHardy et al. (2007)	2007	2	3	1		Metagenomic DNA: Sargasso sea, EBPR-sludge.
		4	6
		5	6
Edwards et al. (2007)	2007	2	3	variable	•	Enriched eukaryotic linear motif datasets (Gould, 2009)
Apostolico et al. (2010a)	2010	2	3	variable	•	PROSITE families
			4
			6
			7
			12

k solid characters can span a window of length at most h. n; minimum number of occurrences of a motif; o; homology allowed. Articles highlighted in gray detail the distribution of the number of motifs on density, length and support. Cursory hints at length and support can also be found elsewhere (Liu et al., 2003; Rigoutsos et al., 2002; Tsirigos and Rigoutsos, 2008). Edwards et al. (2007) considers only patterns with maximum length 10. McHardy et al. (2007) extracts patterns with length exactly h. Apostolico et al. (2010a) considers flexible motifs. Grossi et al. (2011), not included in the table, extracts maximal motifs with global density 0.65 and 0.8.

3. Methods

Let Σ be a reference alphabet and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\bullet \, \notin \, \Sigma $$\end{document} be a don't care. Using standard notation, we define the following partial order among elements of Σ ∪ {•}: a ⪳ b iff either \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$a \in \Sigma$$\end{document} and b = •, or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$a \in \Sigma$$ \end{document} and b = a. We also define the binary operator ⨁ on Σ as follows: a ⨁ b = a if a = b, and a ⨁ b = • otherwise. In the present paper we will use the term gapped pattern (or just pattern) to denote any string in Σ(Σ ∪ {•})* • (Σ ∪ {•})*Σ, i.e. any string in Σ(Σ ∪ {•})*Σ that contains at least one don't care. We will say that pattern v is a subpattern of pattern w if there is an index \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$i \in \{0 , \ldots , \mid w \mid - \mid v \mid \} $$\end{document} such that w[i + j] = v[j] for all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$j \in \{0 , \ldots , \mid v \mid - 1 \} $$\end{document} . We will use the term gapped motif (or just motif) to denote a gapped pattern that occurs at least two times in a string. With \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L}_{s} (w )$$\end{document} , we will denote the set of occurrences of pattern w in s. Given an integer d, we will write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L}_{s} (w )$$\end{document} +d to mean set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\{\ell + d : \ell \in{\cal L}_s(w) \}$$\end{document} }.

The most natural way to introduce gaps in standard k-mers is probably the notion of elementary motif.

Definition 1

(Elementary motif [Rigoutsos and Floratos, 1998]). A rigid gapped pattern w is a (k, h, n)-elementary motif of a string s if it has k solid characters, if it has length at most h, and if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\mid {\cal L}_{s} (w ) \mid \geq n \geq 2$$\end{document} .

Elementary motifs have strong ties to molecular biology: for example, self-contained “functional microdomains,” believed to mediate 15-40% of all protein-protein interactions in intracellular signaling, are rigid gapped patterns with length at most 10 occurring in disordered regions on the surface of multidomain proteins (Gould, 2009). The use of elementary motifs in phylogeny was probably hinted at for the first time in Höhl et al. (2006) and was then partially explored in McHardy et al. (2007).

Elementary motifs grow exponentially in the length of a string, thus limiting the values of k and h that can be probed in practice. To handle longer and sparser structures, we turn to maximal motifs and their bases.

Definition 2

(Maximal motif [Parida et al., 2000]). Let w be a pattern occurring at positions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L} (w ) = \{i_0 , i_1 , \ldots , i_{n - 1} \} $$\end{document} in a string s, where n ≥ 2. We say that w is maximal in composition if no other motif v ≠ w of s has \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L} (v ) = {\cal L} (w )$$ \end{document} and v[i] ⪳ w[i] for all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$i \in \{0 , \ldots , \mid w \mid - 1 \} $$ \end{document} . We say that w is maximal in length if no other motif v ≠ w of s is such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\mid {\cal L} (v ) \mid = \mid {\cal L} (w ) \mid$$ \end{document} and w is a subpattern of v. We say that w is a maximal motif of s if it is both maximal in composition and maximal in length.

Even maximal motifs can grow exponentially in the length of the input string (Parida et al., 2000). A first way to limit this explosion is imposing local density bounds.

Definition 3

(Dense maximal motif [Rigoutsos and Floratos, 1998]). A rigid gapped pattern w is a (k, h, n)-maximal motif of a string s if it is a maximal motif of s with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\mid {\cal L}_{s} (w ) \mid \geq n$$\end{document} , and if every subpattern of w with exactly k solid characters has length at most h > k.

As mentioned, a second way is to consider compact generators of the whole set of maximal motifs.

Definition 4

(Irredundant motif [Parida et al., 2000]). A maximal motif w of a string s is redundant if there exist maximal motifs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$w_0 , w_1 , \ldots , w_{n - 1}$$\end{document} of s such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L}_{s} (w ) = \bigcup\nolimits_{i = 0}^{n - 1}{\cal L}_{s} (w_i )$$\end{document} . We call irredundant a maximal motif of s that is not redundant.

Definition 5

(Tiling motif [Pisanti et al., 2003b]). A maximal motif w of a string s is tiled is there exist maximal motifs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$w_0 , w_1 , \ldots , w_{n - 1}$$\end{document} of s (w_i ≠ w ∀ i) and integers \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$d_0 , d_1 , \ldots , d_{n - 1}$$\end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal L}_{s} (w ) = \bigcup\nolimits_{i = 0}^{n - 1}{\cal L}_{s} (w_i ) + d_i$$\end{document} . We call tiling a maximal motif of s that is not tiled.

The set of irredundant (respectively, tiling) motifs with at least n occurrences in a string s, together with their occurrence lists, contains sufficient information to generate any other maximal motif with at least n occurrences in s without knowing s itself (Parida et al., 2000; Pisanti et al., 2003a). It is thus standard to call this set a basis. For n = 2, the size of the irredundant (respectively, tiling) basis is bounded by a quadratic (respectively, linear) function of the length of s (Pisanti et al., 2003b). The tiling basis is a subset of the irredundant basis, as well as of another distinguished set of maximal motifs that we include in our analysis.

Definition 6

(Autocorrelation [Apostolico and Tagliacollo, 2007]). For strings x and y in Σ⁺, let w = x ⨁ y be the string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$w \in (\Sigma \cup \{ \bullet \}) ^{\max \{ \mid x \mid , \mid y \mid \}}$$\end{document} such that u[i] = x[i] ⨁ y[i] for all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$i \in \{0 , n - 1 \} $$\end{document} (we assume x[i] = • for i < 0 and i ≥ |x|, and y[i] = • for i < 0 and i ≥ |y|). Furthermore, given a string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$w \in (\Sigma \cup \{ \bullet \})^{+}$$\end{document} , we denote with [w] the pattern obtained by removing all leading and trailing don't cares from w. A pattern w is an autocorrelation of a string s if w = [s ⨁ suf_i], where suf_i is the suffix of s starting at position \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$i \in \{1 , \mid s \mid - 1 \} $$\end{document} .

To date, irredundant and tiling motifs have been used as guides for the alignment of multiple sequences (Parida et al., 1999), as codewords for lossy, as well as lossless, compression of texts (Apostolico and Parida, 2003) and images (Amelio et al., 2011), and as features of string kernels for protein classification (Comin and Verzotto, 2010, 2011).

As mentioned, we want to study how classification quality depends on the composition of autocorrelations and of elementary, maximal, irredundant and tiling motifs of a string. More specifically, unlike previous studies that assessed the performance of tree construction algorithms on few phylogenetic trees or tried to settle specific controversies in phylogeny, we want to produce results that are independent of the specific set of organisms used. However, we are not interested in artificial sequences generated by models of sequence evolution. We thus set the 2329 metazoan mitochondrial proteomes available from ncbi on June 2011 as our dataset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} , and we set the corresponding ncbi taxonomy \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal T}$$\end{document} as our reference taxonomy. Mitochondria strike a good balance between phylogenetic significance and manageable string length: datasets containing few dozens mitochondria have been used repeatedly to assess the effectiveness of phylogeny reconstruction algorithms (Apostolico et al., 2006; Li et al., 2001; Stuart et al., 2002b; Ulitsky et al., 2006).

Given a string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$x \in {\cal P}$$\end{document} , we denote with X^(e,k,h) the corresponding composition vector indexed by all possible patterns with exactly k solid characters and length at most h. The component of X^(e,k,h) associated with pattern w contains the number of occurrences of w in x, normalized to the maximum possible number of occurrences of w in x, if w is a (k, h, 2)-elementary motif of x, and zero otherwise. For practical limits, we set h = 20 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$k \in \{2 , 8 \} $$\end{document} , allowing a density than is approximately seven times smaller than the smallest density examined in previous studies on elementary motifs (Table 1). We will thus use the shorthand X^(e,k) for X^(e,k,20). Note that increasing k corresponds to the standard alignment-free methodology of increasing the length of substrings. Elementary motifs with the same k can however span different lengths, and thus have different densities. Similarly, we denote with X^(m,k,h) the composition vector indexed by all possible patterns w such that every substring of w that contains k solid characters spans at most h positions. The component of X^(m,k,h) associated with pattern w is zero if w is not a (k, h, 2)-maximal motif of x, and equals the normalized frequency of w in x otherwise. To render our experiments feasible, we are forced to impose k = 2 and h = 50: this allows a local density that is approximately two times smaller than the smallest local density considered in previous applications of maximal motifs (Table 1), and captures all maximal motifs with 7 or more solid characters that occur in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} . This constraint is also permissive enough to match approximately 98.5% of all gaps contained in release 20.75 of prosite (Califano, 2000; Hart et al., 2000). We will thus use the shorthand X^m for X^(m,2,50). Finally, we set Xⁱ (respectively, X^a and X^t) to denote the composition vectors indexed by all possible patterns w, and containing at component w the normalized frequency of w in x if w is an irredundant motif (respectively, an autocorrelation or a tiling motif) of x, zero otherwise. Autocorrelations, irredundant and tiling motifs do not grow too fast in practice, thus we do not force any density constraint on them.

We are interested in studying how the quality of the reconstructed tree depends on the density of motifs (the ratio between the number of solid characters and length) and on their statistical significance. Given a composition vector X, we denote with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[{\bf X}]_{d_0 , d_1}$$\end{document} the projection of X onto the subspace of patterns with density between d₀ and d₁ (inclusive). We measure the significance of seeing a pattern w occurring n times in a string x with the z-score of n, assuming that x has been generated by a Markov chain of order 1 whose transition probabilities match the empirical frequency of dimers in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} . We denote with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\langle X \rangle_{z_0 , z_1}$$\end{document} the projection of X onto the subspace of patterns with z-score between z₀ and z₁ in x (inclusive). Given a set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$P_i \subset {\cal P}$$\end{document} , we denote with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal T}_i$$\end{document} the corresponding subtree of the reference \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal T}$$\end{document} , and with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[ T_i ] _{d_0 , d_1}^{e, k}$$\end{document} (respectively, with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[T_i] _{d_0, d_1}^{m} , [T_i] _{d_0, d_1}^{i} , [T_i] _{d_0, d_1}^{a} , [T_i] _{d_0 , d_1}^{t}$$\end{document} ) the tree built from the strings in P_i as follows: first, we project each string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$x \in P_i$$\end{document} into the corresponding composition vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[ {\bf X}^{e, k} ] _{d_0, d_1}$$\end{document} (respectively, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[{\bf X}^{m}] _{d_0, d_1} , [{\bf X}^{i}] _{d_0, d_1} , [{\bf X}^{a}] _{d_0, d_1} , [{\bf X}^{t}] _{d_0 , d_1}$$\end{document} ); then, we build the matrix of pairwise Euclidean distances between each pair of such vectors; finally, we run neighbor joining on the resulting matrix. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\langle T_i \rangle_{z_0, z_1}^{e , k} , \langle T_i \rangle_{z_0, z_1}^{m} , \langle T_i \rangle_{z_0, z_1}^{i} , \langle T_i \rangle_{z_0, z_1}^{a} , \langle T_i \rangle_{z_0, z_1}^{t}$$\end{document} , have similar definitions for z-scores. We are mainly interested in what happens at the two extremes of the density and z-score spectra. To study such extremes in elementary motifs, we take 100 random samples \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$P_0 , P_1, \ldots , P_{99}$$\end{document} with replacement from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} , such that each P_i contains the proteomes of 32 different organisms,1 and we plot the functions: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align} & \overrightarrow{d_ {e , k}} (\delta ) \ : = \frac {1} {100} \sum_{i = 0} ^ {99} rf ([T_i]^{e, k}_{0, \delta} , {\cal T} _i )\\ & \overleftarrow{d_ {e , k}} (\delta ) \ : = \frac {1} {100} \sum_ {i = 0} ^ {99} rf ([T_i] ^ {e , k} _ {\delta , 1} , {\cal T} _i )\\ & \overrightarrow{z_ {e , k}} (z ) \ : = \frac {1} {100} \sum_ {i = 0} ^ {99} rf (\langle T_i \rangle _ {- \infty , z} ^ {e , k} , {\cal T} _i ) \\ & \overleftarrow{z_ {e , k}} (z ) \ : = \frac {1} {100} \sum_ {i = 0} ^ {99} rf (\langle T_i \rangle_ {z , + \infty} ^ {e , k} , {\cal T} _i ), \end{align}\end{document}

where r f (T₀, T₁) is the Robinson-Foulds distance (Robinson and Foulds, 1981) (abbreviated with RF in what follows) between trees T₀ and T₁, ranging in this case from 0 to 58. For studying autocorrelations, maximal, irredundant and tiling motifs, we similarly sample \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} and define the homologous functions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow{d_{\alpha}} , \overleftarrow{d_{\alpha}}, \overrightarrow{z_{\alpha}} , \overleftarrow{z_{\alpha}} , \alpha \in \{m , i , a , t \} $$\end{document} . In what follows, we will call \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow{d_{\alpha}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow{d_{\alpha}}$$\end{document} left-to-right analyses, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow{z_{\alpha}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow{z_{\alpha}}$$\end{document} right-to-left analyses. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow{d_{\alpha}}$$\end{document} approximates the average behavior of classification quality in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} as progressively sparser motifs are added to an initial core of extremely dense ones. We expect that motifs under a given density threshold cease to carry phylogenetic information and start to be dominated by noise. Similarly, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow{z_{\alpha}}$$\end{document} approximates the average behavior of classification quality in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} as progressively less statistically significant motifs are added to an initial core of highly significant ones. We expect the large mass of motifs with low z-score to be plesiomorphic features dominated by noise, and the few motifs with extremely high z-score to be peculiarities of each taxon that are difficult to find in other organisms. Apomorphic features should intuitively be found at “intermediate” z-scores, sufficiently high to distinguish them from random occurrences and yet low enough not to be idiosyncrasies of a given proteome.

The purpose of this work is not to achieve the best classification in a specific dataset, but rather to study the shape of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow{d_{\alpha}} , \overleftarrow{d_{\alpha}}, \overrightarrow{z_{\alpha}}$$\end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow{z_{\alpha}}$$\end{document} on a large number of samples. This is why we have selected the components of our pipeline to maximize speed. For example, unlike state-of-the-art alignment-free algorithms, we do not store z-scores in composition vectors: this makes the computation of distance between two strings x and y extremely fast, because it allows to discard motifs that occur in neither x nor y, and to set to zero all components of X that correspond to motifs that do not occur in x: a crucial advantage when composition vectors are indexed by all possible rigid gapped patterns. Even approximating the z-score of seeing no occurrence in y of a motif that occurs in x makes our experiments unbearably slow. Removing z-scores from composition vectors has the additional advantage of avoiding the comparison among the z-scores of different motifs that would be implicit in the resulting distance: this comparison is unreliable in cases, like ours, in which the z-scores of motifs do not follow a normal distribution (Sinha and Tompa, 2002). Finally, the size of the alphabet, the number of motifs, their length and their sparsity, make considering Markov chains of order greater than one impractical.

We use the publicly available version of teiresias (Rigoutsos and Floratos, 1998) to extract dense elementary and maximal motifs, and we build fast implementations of the algorithms in Apostolico and Parida (2004); Pisanti et al. (2003b); and Sinha and Tompa (2000) to extract irredundant and tiling motifs, and to compute the z-score of rigid gapped patterns. We feed our distance matrices to the phylip package (Felsenstein, 2005) for building neighbor-joining trees and for computing RF distances.

4. Results

4.1. Classifying with elementary motifs

The set of elementary motifs with k solid characters and length 20 contains as much phylogenetic information as its supersets for any value of k: indeed, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{ek}}$$\end{document} is approximately flat for any k (Fig. 1a). As density increases, the number of motifs per density decreases like a polynomial of low order, thus denser motifs do encode phylogenetic information themselves. As in standard k-mers, the number of solid characters is the main force behind classification quality. In elementary motifs, however, the smallest RF distance is achieved by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{e3}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{e4}}$$\end{document} , while increasing k to 5 or larger degrades classification. Allowing elementary motifs with 3 solid characters to span up to 50 positions continues to show a flat \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{e3}}$$\end{document} curve (Fig. 1a, insert), implying that elementary motifs with exactly 3 solid characters and length 50 are not dominated by noise, but rather encode as much phylogenetic information as denser ones.

FIG. 1.
The classification quality of elementary motifs: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overrightarrow {d_{ek}}$$ \end{document} (a) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overleftarrow {d_{ek}}$$ \end{document} (b) for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k \in \{2 , \ldots , 8 \} $$ \end{document} . The insert in panel (a) shows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overrightarrow {d_{e3}}$$\end{document} at densities between 0 and 0.2 and at RF distances between 0 and 58. (d, e) The median, the 25th and 75th percentiles, and the maximum and minimum of the RF distances of all samples for k = 4. To avoid clutter, the horizontal axis is distorted so that densities are equally spaced. (c) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overrightarrow {z_{e4}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overleftarrow {z_{e4}}$$\end{document} . Different values of k yield similar curves. The gray area indicates the approximate positions of the 5% and 95% values of the cumulative distribution of motifs, averaged over all strings in the dataset. The insert zooms the containing panel at z-scores between −2 and 10 and at distances 30 and larger.

The right-to-left analysis confirms that the sparsest motifs at all values of k carry phylogenetic signal: adding them to denser motifs makes tree topology converge, does not degrade classification quality at k < 5, and even reduces RF distance at k ≥ 5 (Fig. 1b). Dense motifs, on the other hand, belong to two different categories. Those with k ≥ 5 contain little phylogenetic information: classification quality is poor (or even null for k ≥ 7) when such very dense motifs are considered, and it gradually improves when progressively sparser motifs are added. Since the composition of standard ungapped k-mers with k ≥ 5 yields good classifications on the same dataset (Figure 6a), these trends suggest that the performance of k-mer methods crucially depends on words that occur just once, or that do not occur, in mitochondrial proteomes. Elementary motifs with k < 5 and length k + 1, on the other hand, achieve the global minimum of the corresponding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ek}}$$\end{document} , which remains constant when progressively sparser motifs are added. Once again, the smallest RF distance is achieved by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{e4}}$$\end{document} . Finally, we note that for k > 2 most motifs occur just two or three times in each string, so our distance measure between the composition vectors of two strings becomes effectively the Jaccard distance between the corresponding sets of elementary motifs.

Elementary motifs preferentially amass at the low end of the z-score spectrum: for example, approximately 95% of all elementary motifs with k = 4 have z-score at most 6. Counterintuitively, motifs with low z-score carry a strong phylogenetic signal for each k: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{zk}}$$\end{document} decreases or remains constant when such motifs are included, reaching its global minimum around 0; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{zk}}$$\end{document} decreases or remains constant when motifs with progressively higher z-score are added, until it reaches a global minimum when the bulk of all elementary motifs with low z-score have been included. Figure 1c details \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{z4}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{z4}}$$\end{document} : curves for different values of k follow similar trends.

4.2. Classifying with maximal motifs

The distribution of maximal motifs with respect to density is quantized (see Fig. 4a below): in particular, the densities d for which at least one motif exists follow a 1/d trend. This is due to the fact that most maximal motifs have between 3 and 5 solid characters and variable lengths. Significantly, the distribution of the number of motifs on density is based on a single module that is iteratively repeated and scaled (see Fig. 4b below). Preliminary experiments show that this regular shape persists when proteomes are reshuffled, implying that it is a property of the density of characters rather than a regularity in mitochondrial sequences. Previous works have studied such a distribution in different datasets (Rigoutsos et al., 1999a,b), but none has considered densities smaller than 0.4 and sufficiently small bins to detect a quantization.

Maximal motifs are inherently infrequent and sparse: approximately 80% of all maximal motifs occurs 3 or 4 times, and approximately 80% of all maximal motifs have density smaller than 0.1 (see Fig. 4 below). Our distance measure between the composition vectors of two strings thus becomes effectively the Jaccard distance between the corresponding sets of motifs. At the high end of the density spectrum, maximal motifs cluster mainly around a small, discrete set of densities: 0.6, approximately 0.67, 0.75 and 0.8. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m}}$$\end{document} sharply decreases when motifs at these densities are progressively added, reaching a value that is just 2 units larger than the global minimum as soon as density 0.75 is reached (Fig. 2b, insert). At this point, just 0.05% of all maximal motifs have been included. Using only motifs with density 0.8 is not sufficient to achieve the same RF distance, and using even denser motifs yields poor classifications. Counterintuitively, adding sparser motifs keeps \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m}}$$\end{document} constant or slowly decreasing, and makes tree topology converge: the global minimum is reached at density approximately 0.135, when approximately 9% of all maximal motifs have been included (Fig. 2b). Adding the remaining 89% motifs with even lower density causes only minor oscillations to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m},}$$\end{document} indicating that such motifs too are rich in phylogenetic signal, and that the taxonomy encoded by the composition of such sparse motifs agrees with the taxonomy encoded by the composition of denser ones. The high signal-to-noise ratio of extremely sparse motifs is confirmed by the left-to-right analysis: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m}}$$\end{document} uniformly decreases when progressively denser motifs are added, until it plateaus around density 0.15, when approximately 93% of all maximal motifs have been included (Fig. 2a). The minima of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m}}$$\end{document} differ by just 1.3 units, suggesting that very sparse and very dense motifs tell similar phylogenetic stories, despite being such different structures: indeed, one set has average length 70 and average density 0.067, while the other has average length 5 and average density 0.8.

FIG. 2.
The classification quality of maximal motifs. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_m}$$\end{document} (a) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_m}$$\end{document} (b) are represented with thick lines. The plots show also the first and third quartiles (gray areas), the minimum and maximum (dashed lines) of all samples taken. The insert in panel (b) shows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_m}$$\end{document} at densities 0.65 and larger and at RF distances 30 and larger. In the insert, horizontal grid lines occur every 5 units, and vertical grid lines occur every 0.05 units. (d), (e) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{mk}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{mk}}$$\end{document} , respectively, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$k \in \{2 , \ldots , 7 \} $$\end{document} . The insert in panel (e) shows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m3}}$$\end{document} at densities 0.6 and larger and at RF distances 30 and larger. In the insert, horizontal grid lines occur every 5 units, and vertical grid lines occur every 0.05 units. (c) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {z_m}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {z_m}$$\end{document} . The gray area indicates the approximate position of the 25% and 75% values of the cumulative distribution of maximal motifs on z-score, averaged over all strings in the dataset. The insert zooms panel (c) at z-scores 30 and smaller and at RF distances between 34 and 44.

FIG. 3.
Classification quality of autocorrelations (A), tiling motifs (T), irredundant motifs (I), and irredundant motifs with exactly 3 solid characters (I3) as a function of density: left-to-right (a) and right-to-left (b) analysis. The insert in panel (b) zooms densities [0,0.2] and RF distances [35,53]. (c) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{ak}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ak}}$$\end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$k \in \{4 , 5 , 6 , 7 \} $$\end{document} . Arrows show the direction of increasing k. Curves for k = 3 are not shown because they are similar to the corresponding curves for k = 5. (d) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{ik}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ik}}$$\end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$k \in \{3 , 4 , 5 , 6 \} $$\end{document} . Curves for k = 2 are not shown because they are similar to the corresponding curves for k = 4.

Motivated by the strong dependence between classification quality and number of solid characters k in elementary motifs, we perform the same analysis on maximal motifs. As above, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{mk}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{mk}}$$\end{document} be the curves of the left-to-right and of the right-to-left analysis of maximal motifs with exactly k solid characters. k is again a key factor in classification quality: quality improves going from k = 2 to k = 3 and 4, and degenerates for k ≥ 5. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m3}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m4}}$$\end{document} are approximately equal (Fig. 2d), while \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m3}}$$\end{document} is consistently the lowest in the right-to-left analysis (Fig. 2e). Remarkably, the minima of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m3}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m3}}$$\end{document} are approximately equal to those of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{m}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m},}$$\end{document} respectively, even though maximal motifs with k = 3 are just 34% of all maximal motifs. Of all maximal motifs with exactly 3 solid characters, the 79% with density at most 0.1 and the 50% with density at least 0.065 are sufficient to achieve the corresponding minima. In particular, using only motifs with k = 3 and density 0.6 or larger (approximately 4.5% of all maximal motifs with k = 3) leads to a classification quality that is just one unit larger than the global minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m3}}$$\end{document} (Fig. 2e, insert).

The distribution of maximal motifs on z-score is concentrated between scores approximately 0 and 25; unlike elementary motifs, it has a long decreasing tail at high z-score: approximately 10% of all maximal motifs in a proteome has z-score equal to 200 or larger. The right-to-left analysis shows that motifs with z-score equal to 100 or larger contain limited phylogenetic signal, as they need to be complemented by motifs with lower z-score to reach (at z-score one) the global minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {z_{m}}$$\end{document} , which is approximately 2.6 units larger than the global minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m}}$$\end{document} (Fig. 2c). In the left-to-right analysis, the bulk of motifs with z-score equal to 15 or lower contains sufficient phylogenetic signal to achieve the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {z_{m}}$$\end{document} .

4.3. Classifying with motif bases

Consistent with previous studies (Gallé, 2011), autocorrelations, tiling motifs, and irredundant motifs are sparse, long and infrequent: 90% or more of these motifs have density 0.2 or smaller, and approximately 50% of all autocorrelations, 70% of all tiling motifs, and 40% of all irredundant motifs have length 100 or larger, compared to just 10% of all maximal motifs. Moreover, approximately 67% of all irredundant motifs, 90% of all autocorrelations and 99% of all tiling motifs occur 2 times, compared to just 4% of all maximal motifs (see Fig. 4 below). While the distribution of maximal motifs on length is unimodal, the distribution of autocorrelations and of irredundant and tiling motifs is multimodal, with peaks up to length 300 (see Fig. 4c below): these shapes persist when proteomes are reshuffled, implying that they are not imputable to some regularity in the sequence. The distribution of irredundant motifs and autocorrelations on density is very similar to the distribution of maximal motifs: densities are again quantized, and the overall shape is based on a single module that is repeated and scaled.

FIG. 4.
Density, length, number of solid characters, and support in maximal motifs and their bases. (a) Number of maximal motifs with density d in string 1. The same shape recurs in all strings of the dataset. (b) Detail of panel (a): the distribution of maximal motifs on density is based on a single unit that is repeated and scaled. (c,d,e) Average number of maximal, irredundant, tiling motifs and autocorrelations with length l (c), with k solid characters (d), and with support s (e) in the dataset.

As for maximal motifs, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{a}} , \overrightarrow {d_{t}}$$\end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{i}}$$\end{document} uniformly decrease when progressively denser motifs are added, and they finally plateau at a global minimum at density approximately 0.4 for irredundant and autocorrelations, and approximately 0.115 for tiling2 (Fig. 3a). This trend indicates that extremely sparse motifs do carry phylogenetic signal. Rather than remaining constant like \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{m}} , \overleftarrow {d_{a}} , \overleftarrow {d_{t}}$$\end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{i}}$$\end{document} decrease when progressively sparser motifs are added, until they reach corresponding global minima at density approximately 0.115 for autocorrelations and irredundant, and approximately 0.15 for tiling3 (Fig. 3b). Significantly, such minima are smaller than the values of the functions at 0, indicating that phylogenetic signal is differentially distributed along the density spectrum. Indeed, the right-to-left analysis highlights a specific band of densities as more affected by noise: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{a}} , \overleftarrow {d_{t}}, $$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} sharply increase when tiling motifs with density between 0.1 and 0.15 (approximately 28% of all tiling motifs), and autocorrelations and irredundant motifs with density between 0.085 and 0.115 (approximately 30% of the respective totals) are added (Fig. 3b, insert). Including motifs with even lower density brings all curves back to values close to their global minima.

Despite having similar trends, the curves of autocorrelations, tiling and irredundant motifs differ considerably in absolute value. Tiling motifs display the worst performance: the distance computed using all tiling motifs is approximately 8.7 larger than both the distance computed using all autocorrelations, and the distance computed using all irredundant motifs, while the latter two differ by approximately 1.5 from each other. The curve of tiling motifs is consistently higher than the curve of autocorrelations at any density: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_t}$$\end{document} is at least 8 units larger than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_a}$$\end{document} in approximately 48% of the sampled densities, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_t}$$\end{document} is at least 8 units larger than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_a}$$\end{document} in approximately 39% of the sampled densities. This indicates that the 90% redundant autocorrelations that are discarded during the construction of the tiling basis contain a strong phylogenetic signal. Another indication that redundancy is important in classification comes from irredundant motifs, a superset of the tiling basis that is approximately 15 times larger. Irredundant motifs display the best performance among the sets of motifs considered in this section: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} is approximately 2 units smaller than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_a}$$\end{document} at its minimum, and the difference reaches peaks of 9 at higher densities.

Together with density and redundancy, the number of solid characters has again a strong influence on classification quality. As above, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{ak}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ak}}$$\end{document} be the curves of the left-to-right and right-to-left analysis of autocorrelations with exactly k solid characters, and assume that similar symbols are defined for tiling and irredundant motifs. In autocorrelations, distance decreases going from k = 3 to 4, then it monotonically increases for k ≥ 5, both in the left-to-right and in the right-to-left analysis (Fig. 3c,d). A similar trend characterizes irredundant motifs, in which distance decreases going from k = 2 to 3, then monotonically increases for k ≥ 4. Contrary to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_a}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} , no \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ak}}$$\end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{ik}}$$\end{document} increases when density decreases, thus the oscillations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_a}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} at low density are imputable to local changes in the abundance of motifs with different k.

Remarkably, using only irredundant motifs with exactly 3 solid characters (approximately 3% of the total) improves classification over using the whole set of irredundant motifs (Fig. 3a): the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{i3}}$$\end{document} is 5.4 units smaller than the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_i}$$\end{document} , and it is reached using the approximately 80% sparsest fraction of all irredundant motifs with 3 solid characters; the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{i3}}$$\end{document} is 2.2 units smaller than the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} , and it is reached using the approximately 78% densest fraction of all irredundant motifs with 3 solid characters. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{i3}}$$\end{document} improves by approximately 5 units over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_i}$$\end{document} at densities 0.4 and larger, and the difference reaches peaks of 12 at smaller densities. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{i3}}$$\end{document} improves by approximately 4 units over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_i}$$\end{document} at densities 0.2 or smaller, with peaks of 10 around density 0.08. We stress that the minima of both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{i3}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{i3}}$$\end{document} are achieved by long, sparse, and infrequent motifs: the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_{i3}}$$\end{document} corresponds to a set of motifs with average density 0.09, average length 132 and average support 2.8; the minimum of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overleftarrow {d_{i3}}$$\end{document} is achieved by motifs with average density 0.14, average length 98, and average support 3.2. Such minima are equal, suggesting that motifs at the two ends of the density spectrum support similar phylogenies.

No value of k has a comparably distinguished role in autocorrelations. For example, using only autocorrelations with k = 4 (approximately 6% of the total) improves by just 1.7 over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\overrightarrow {d_a}$$\end{document} , and larger values of k degrade classification both in the left-to-right and in the right-to-left analysis.

5. Discussion

Rigid gapped motifs in polypeptides have traditionally been associated with signatures that group proteins into families with homologous function or structure. In this article, we have shown that the composition of gapped motifs can be used to construct phylogenies from mitochondrial proteomes. Phylogenies with comparable distance to a reference taxonomy can be built using either extremely dense or extremely sparse motifs. For example, elementary motifs with exactly k solid characters and length 20 yield phylogenies of the same or better quality than those produced by elementary motifs with k solid characters and length k + 1, and maximal motifs with density less than approximately 0.15 yield phylogenies of the same quality as those produced by maximal motifs with density 0.75 or larger. Using even lower densities degrades classification in maximal motifs and their bases, but surprisingly keeps groups of related organisms together (Fig. 5). The length and sparsity of such motifs resonate in interesting ways with long-range correlations of various kinds that are known to have a key role in proteins (Weiss and Herzel, 1998): studying the structure of such sparse motifs and their occurrences in the protein space, as well as extending the alphabet of motifs to allow groups of homologous amino acids, would thus be natural extensions of this work.

FIG. 5.
The composition of extremely sparse motifs carries a strong phylogenetic signal. (a) Reference tree from ncbi. See Ulitsky et al., 2006 and references therein for other algorithms applied to the reconstruction of this tree. (b) Tree built using elementary motifs with 3 solid characters and length 50. Average number of motifs per proteome: 30135. (c) Tree built using maximal motifs with 3 solid characters and density at most 0.0308. Average length: 98. Average number of motifs per proteome: 254. (d) Tree built using irredundant motifs with 3 solid characters and density at most 0.031. Average length: 118. Average number of motifs per proteome: 98.

In tiling motifs, irredundant motifs and autocorrelations, extremely dense motifs, as well as sparse motifs in a specific density range, contain comparatively little phylogenetic signal. Contrary to what has been observed in the remote homology detection of proteins (Comin and Verzotto, 2011), redundancy seems to be a key factor for the efficient reconstruction of phylogenies: classification quality improves when moving from the smallest tiling basis to its supersets, autocorrelations and irredundant motifs. Our analysis highlights also a third force behind classification quality: the number of solid characters. Contrary to the convergence seen when increasing the length of k-mers, classification with gapped motifs reaches its best at k = 3 or 4, and degenerates for larger k. In particular, considering only motifs with exactly 3 solid characters is sufficient—and sometimes even necessary—to achieve the best classification quality in elementary, maximal and irredundant motifs.

Another point in which our analyses differ from traditional k-mer approaches is the role of statistical correction. Downplaying k-mers with low statistical significance has been reported to be essential for achieving good classifications (Chu et al., 2004; Qi et al., 2004); our experiments, on the other hand, show that gapped motifs with z-score close to zero carry a strong phylogenetic signal, and classification quality degrades when such motifs are discarded.

Figure 6a,b summarizes the sets of motifs that achieve the best average classification quality in our experiments. Such sets are extremely fast to compute in practice, and turn out to be largely disjoint from prosite. Remarkably, the average classification quality of such sets is comparable to state-of-the-art methods based on substrings, even though all our motifs repeat at least two times in the input strings, and even though we use a simplistic setup based on Euclidean distance and raw frequencies (which in practice reduces to the Jaccard distance). This motivates further applications of gapped motifs to alignment-free sequence comparison, as well as a systematic search for the subsets of motifs that yield the best classification. The fact that substring-based and motif-based methods never push the average distance from the ncbi taxonomy below 30 suggests also the existence of a practical upper bound to the performance of alignment-free algorithms, that would be interesting to study more extensively.

FIG. 6.
Average classification quality (a) and average size (b) of the sets of motifs that performed best in our experiments. (a) shows median, 25th and 75th percentiles, and minimum and maximum of rf distance over 100 samples of size 32 from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal P}$$\end{document} . (c) Distance between the tree produced by a set of motifs and the tree produced by another set of motifs, averaged over 100 random samples from our dataset. e4l: elementary, k = 4, length 20. e4r: elementary, k = 4, length 5. e3l: elementary, k = 3, length 20. e3r: elementary, k = 3, length 4. m3l: maximal, k = 3, density ≤ 0.1. m3r: maximal, k = 3, density ≥ 0.065. m3rr: maximal, k = 3, density ≥ 0.6. i3l: irredundant, k = 3, density ≤ 0.15. i3r: irredundant, k = 3, density ≥ 0.075; ncbi: ncbi taxonomy. For reference, we include a small sampler of string-based alignment-free algorithms. cvk: composition vectors using k-mers. ncd: Normalized Compression Distance with gzip −9. acs: Average Common Substring.

Albeit being all at comparable distances from the ncbi taxonomy, the sets of gapped motifs shown in Figure 6a do not tell all the same phylogenetic story. As a first, qualitative glimpse into the problem of which motifs support which phylogeny, we computed the matrix of pairwise rf distances between trees produced by the best performing sets of gapped motifs and by a small sampler of substring-based alignment-free methods (cvtree with lengths ranging from 3 to 7 (Xu and Hao, 2009), the Normalized Compression Distance using gzip (Cilibrasi and Vitanyi, 2005), and the Average Common Substring (Ulitsky et al., 2006), averaged over 100 random samples from our dataset (Fig. 6c). The matrix shows at least two clusters: the first consisting of composition vectors with length greater than 3, ncd and acs, the second consisting of elementary and maximal motifs with 3 solid characters. This suggests that phylogenies produced by elementary and maximal motifs are more similar to each other than to phylogenies built by substrings. Interestingly, composition vectors with length 3 tend to be more similar to the cluster of gapped motifs than to the cluster of substrings, while elementary motifs with 4 solid characters and length 20 tend to be more similar to the cluster of substrings than to the cluster of gapped motifs. Irredundant motifs seem to form a third cluster on their own, and acs seems to be systematically different from all other substring-based methods. We leave to future research a more detailed study on this topic.

The fact that gapped motifs carry phylogenetic signal could be a peculiarity of proteomes (long regions without solid characters could represent loops where mutations are more likely (e.g., see Califano, 2000; Hart et al., 2000)) or even just of mitochondrial proteomes. It is natural to envision experiments that apply gapped motifs to the reconstruction of phylogenies from the genic and intergenic dna of longer genomes. Scaling to genomes would rule off the possibility of using the composition of all elementary and maximal motifs, and would move the focus on autocorrelations, tiling motifs, irredundant motifs, and on motifs with a controlled number of solid characters. Experiment with flexible gaps (Apostolico et al., 2010a; Jonassen et al., 1995) would also come natural: statistically significant maximal flexible motifs have already been shown to identify biologically significant patterns in prosite families (Apostolico et al., 2005).

Footnotes

Acknowledgments

F.C. thanks Matthias Gallé, for helpful discussion and for pointing out the distributional studies on tiling motifs in Gallé (), which inspired part of this article, and David Burstein, for providing an implementation of the Average Common Substring algorithm.

Disclosure Statement

No competing financial interests exist.

1

32 is a good balance between computation time and realistic input size.

2

Including approximately 98% of all motifs in each basis.

3

Including approximately 29% of all autocorrelations, 3% of all tiling motifs, and 36% of all irredundant motifs.

References

Amelio

, Apostolico

, Rombo

2011. Image compression by 2D motif basis. Proc. Data Compression Conf. 2011, 153–162.

Apostolico

2010. Maximal words in sequence comparisons based on subword composition. Lect. Notes Comput. Sci., 6060:34–44.

Apostolico

, Parida

2003. Compression and the wheel of fortune. Proc. Data Compression Conf. 2003, 143–152.

Apostolico

, Parida

2004. Incremental paradigms of motif discovery. J. Comput. Biol., 11:15–25.

Apostolico

, Tagliacollo

2007. Optimal offline extraction of irredundant motif bases. Lect. Notes Comput. Sci., 4598:360–371.

Apostolico

, Comin

, Parida

2005. Conservative extraction of over-represented extensible motifs. Bioinformatics, 21:i9–i18.

Apostolico

, Comin

, Parida

2006. Mining, compressing and classifying with extensible motifs. Algorithms Mol. Biol., 1:4.

Apostolico

, Comin

, Parida

2010a. VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform., 7:752–726.

Apostolico

, Denas

, Dress

2010b. Efficient tools for comparative substring analysis. J. Biotechnol., 149:120–126.

10.

Atteson

1998. Calculating the exact probability of language-like patterns in biomolecular sequences. Proc. ISMB-98, 17–24.

11.

Ben-Hur

, Brutlag

2003. Remote homology detection: a motif based approach. Bioinformatics, 19:i26–i33.

12.

Califano

2000. SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, 16:341–357.

13.

Chu

K.H.

, Qi

, Yu

Z.-G.

et al. 2004. Origin and phylogeny of chloroplasts revealed by a simple correlation analysis of complete genomes. Mole. Biol. Evol., 21:200–206.

14.

Cilibrasi

, Vitanyi

2005. Clustering by compression. IEEE Trans. Inform. Theor., 51:1523–1545.

15.

Comin

, Parida

2008. Detection of subtle variations as consensus motifs. Theor. Comput. Sci., 395:158–170.

16.

Comin

, Verzotto

2010. Classification of protein sequences by means of irredundant patterns. BMC Bioinform., 11,Suppl. 1:S16.

17.

Comin

, Verzotto

2011. The irredundant class method for remote homology detection of protein sequences. J. Comput. Biol., 18:1–11.

18.

Darzentas

, Rigoutsos

2005. Sensitive detection of sequence similarity using combinatorial pattern discovery: a challenging study of two distantly related protein families. Proteins, 61:926–937.

19.

Dong

, Wang

, Lin

2006. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22:285–290.

20.

Edwards

, Davey

, Shields

2007. SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE, 2:e967.

21.

Felsenstein

2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington: Seattle.

22.

Gallé

2011. Searching for compact hierarchical structures in DNA by means of the smallest grammar problem. Ph.D. disertation. Université de Rennes 1: France.

23.

Gould

2010. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res., 38:1–14.

24.

Grossi

, Pietracaprina

, Pisanti

et al. 2011. MADMX—a strategy for maximal dense motif extraction. J. Comput. Biol., 18:535–545.

25.

Han

, Baker

1995. Recurring local sequence motifs in proteins. J. Mol. Biol., 251:176–187.

26.

Han

, Bystroff

, Baker

1997. Three-dimensional structures and contexts associated with recurrent amino acid sequence patterns. Prot. Sci., 6:1587–1590.

27.

Hart

, Royyuru

, Stolovitzky

et al. 2000. Systematic and fully automated identification of protein sequence patterns. J. Comput. Biol., 7:585–600.

28.

Höhl

, Rigoutsos

, Ragan

2006. Pattern-based phylogenetic distance estimation and tree reconstruction. Evol. Bioinform. Online, 2:359–375.

29.

Jonassen

, Collins

, Higgins

1995. Finding flexible patterns in unaligned protein sequences. Prot. Sci., 4:1587–1595.

30.

Leslie

, Kuang

2003. Fast kernels for inexact string matching. 16th Annu. Conf. Learn. Theor. 7th Kernel Workshop, 114–128.

31.

, Badger

, Chen

et al. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17:149–154.

32.

Lingner

, Meinicke

2006. Remote homology detection based on oligomer distances. Bioinformatics, 22:2224–2231.

33.

Liu

, Califano

2001. Functional classification of proteins by pattern discovery and top-down clustering of primary sequences. IBM Syst. J., 40:379–393.

34.

Liu

, Zhang

, Stolovitzky

et al. 2003. Motif-based construction of a functional map for mammalian olfactory receptors. Genomics, 81:443–456.

35.

Lodhi

, Saunders

, Shawe-Taylor

et al. 2002. Text classification using string kernels. J. Mach. Learn. Res., 2:419–444.

36.

McHardy

, Martín

, Tsirigos

et al. 2007. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods, 4:63–72.

37.

Nicodème

, Doerks

, Vingron

2002. Proteome analysis based on motif statistics. Bioinformatics, 18:S161–S171.

38.

Parida

, Floratos

, Rigoutsos

1999. An approximation algorithm for alignment of multiple sequences using motif discovery. J. Combin. Optim., 3:247–275.

39.

Parida

, Rigoutsos

, Floratos

et al. 2000. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. Proc. 11th Annu. ACM-SIAM Symp. Discr. Algorithms (SODA 2000), 297–308.

40.

Pisanti

, Crochemore

, Grossi

et al. 2003a. Bases of motifs for generating repeated patterns with don't cares. University of Pisa: Italytechnical report TR-03-02.

41.

Pisanti

, Crochemore

, Grossi

et al. 2003b. A basis of tiling motifs for generating repeated patterns and its complexity for higher quorum. Proc. 28th Math. Found. Comput. Sci. Symp., 622–631.

42.

Pisanti

, Crochemore

, Grossi

et al. 2005. Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Trans. Comput. Biol. Bioinform., 2:40–50.

43.

Pisanti

, Carvalho

, Marsan

et al. 2006. RISOTTO: Fast extraction of motifs with mismatches. Lect. Notes Comput. Sci., 3887:757–768.

44.

, Wang

, Hao

2004. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol., 58:1–11.

45.

Rigoutsos

, Floratos

1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14:55–67.

46.

Rigoutsos

, Floratos

, Ouzounis

et al. 1999a. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins, 37:264–277.

47.

Rigoutsos

, Gao

, Floratos

et al. 1999b. Building dictionaries of 1D and 3D motifs by mining the unaligned 1D sequences of 17 archaeal and bacterial genomes. Proc. 7th Int. Conf. Intell. Syst. Mol. Biol., 223–233.

48.

Rigoutsos

, Huynh

, Floratos

et al. 2002. Dictionary-driven protein annotation. Nucleic Acids Res., 30:3901–3916.

49.

Robinson

, Foulds

1981. Comparison of phylogenetic trees. Math. Biosci., 53:131–147.

50.

Rousu

, Jaakkola

2005. Efficient computation of gapped substring kernels on large alphabets. J. Mach. Learn. Res., 6:1323–1344.

51.

Sigrist

, Cerutti

, de Castro

et al. 2010. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res., 38:161–166.

52.

Sims

, Jun

S.-R.

, Wu

et al. 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA, 106:2677–2682.

53.

Sinha

, Tompa

2000. A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8:344–354.

54.

Sinha

, Tompa

2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res., 30:5549–5560.

55.

Sosinsky

, Honig

, Mann

et al. 2007. Discovering transcriptional regulatory regions in Drosophila by a nonalignment method for phylogenetic footprinting. Proc. Natl. Acad. Sci. USA, 104:6305–6310.

56.

Stuart

, Moffett

, Baker

2002a. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18:100–108.

57.

Stuart

, Moffett

, Leader

2002b. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol. Biol. Evol., 19:554–562.

58.

Tsirigos

, Rigoutsos

2005. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res., 33:922–933.

59.

Tsirigos

, Rigoutsos

2008. Human and mouse introns are linked to the same processes and functions through each genome's most frequent non-conserved motifs. Nucleic Acids Res., 36:3484–3493.

60.

Ulitsky

, Burstein

, Tuller

et al. 2006. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol., 13:336–350.

61.

Vinga

2007. Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. Pham

, Yan

, Crane

Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers: New York, 71–107.

62.

Vinga

, Almeida

2003. Alignment-free sequence comparison—a review. Bioinformatics, 19:513–523.

63.

Weiss

, Herzel

1998. Correlations in protein sequences and property codes. J. Theor. Biol., 190:341–353.

64.

, Hao

2009. CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res., 37:W174–W178.

65.

Zhang

, Harrison

, Gerstein

2002. Digging deep for ancient relics: a survey of protein motifs in the intergenic sequences of four eukaryotic genomes. J. Mol. Biol., 323:811–822.