The Subsequence Composition of Polypeptides

Abstract

The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.

1. Introduction

Defining and measuring the amount of information contained in biological strings, and relating this information to structure, function, and chemical activity (Carothers et al., 2004) has long been an elusive problem (Brooks, 2003), both for the inherent difficulty of formalizing intuitive notions of “information” (Szpankowski and Konorski, 2007) and for the peculiar structure of these strings.

Proteins, for example, are optimized by selection to assume specific chemical properties and spatial conformations: this streamlining tends to remove redundancies (Anfinsen, 1972), yielding strings of amino acids in which every symbol carries information; such “slightly edited random strings” (Weiss et al., 2000) are therefore hardly distinguishable from their random permutations when measured with both statistical and algorithmic definitions of information. Not even translating amino acids with scales that capture relevant physico-chemical properties provides significantly more insight: for example, the distribution of hydrophobicity—a key property influencing folding and spatial stability—along the sequence of most proteins is well known to be indistinguishable from random (White and Jacobs, 1990; Schwartz and King, 2006). The very presence of repetitions and redundancies has been implicated in human diseases at the DNA level (Benson and Waterman, 1994) and in the formation of toxic fibrillar structures at the protein level (Broome and Hecht, 2000). Repetitions in polypeptides have also been conjectured to multiply the folding possibilities by introducing many interactions with similar energy (Wan and Wootton, 2000): these alternatives would prevent the convergence of the folding process into a global minimum. Wet-lab experiments with random polypeptides (Ptitsyn and Volkenstein, 1986; Davidson and Sauer, 1994) have shown that secondary and tertiary structures do appear frequently and spontaneously in random strings built upon suitably small alphabets. Many of the basic folding patterns of natural proteins can even be explained theoretically by assuming the randomness of their primary sequence (White, 1994): this suggests that the main carrier of folding information is the composition of amino acids rather than their linear ordering (Rahman and Rackovsky, 1995). All these clues, that nicely fit into the neutral theory of evolution, have oriented biochemists towards seeing modern proteins as memorized ancestral random polypeptides, that have been slightly edited by selection to optimize their active sites and stability under specific physiological conditions.¹ As Jaques Monod has put it (Monod, 1972):

In 1952, F. Sanger described the first complete sequence of a globular protein. This turned out to be both a revelation and a deception. This sequence, defining the structure and therefore the elective properties of a functional protein (insulin), did not show any regularity, characteristic feature, or limit. In those days it was hoped that, with the addition of further documentation, it might be possible to find the general laws of association and some functional correlations. Today we know hundreds of sequences corresponding to the proteins extracted from many different organisms. From them and their systematic comparison, performed with the help of up-to-date analysis and calculation devices, we can now deduce the general law: the chance law. More precisely, these structures are “random” because by knowing precisely the order of 199 residues of a protein containing two hundred it is not possible to formulate a theoretical or empirical law which allows us to predict the nature of the only residue still to be analytically identified.

Along with this intrinsic, evolutionary randomness, two additional problems make the definition of information in polypeptides even more elusive. The first problem is context: the translation of an aminoacid string into a three-dimensional structure is made possible by the cooperation of many distant substrings of the same and of different molecules (e.g., chaperones, multimers); the transport of many proteins to their proper cellular compartment and the acquisition of their final function depend on multiple post-translational modifications. Therefore, the information that leads a protein to assume its specific biological role is distributed in a context of interactions that trascends the single sequence (Adami and Cerf, 2000; Szpankowski and Konorski, 2007; Galas et al., 2008). The second problem is resolution: the key functions of a protein are often implemented by few atoms configured in specific spatial arrangements and bearing specific chemical properties. A single letter of the primary sequence of a protein hides tens of such atoms, positions and properties: these sub-symbolic signals are doomed to evade any measure of information that treats proteins as strings on the traditional amino acid alphabet.

Notwithstanding these fundamental issues, the question of what and how much information is carried by amino acid sequences has historically attracted a lot of attention, both for obvious purposes of classification, prediction and insight into folding and evolution, and for the screening and synthesis of artificial polypeptides for their use in new drugs (Davidson et al., 1995; Doi et al., 2005). Some successes have been recorded, especially in the context of large sets of non-homologous proteins—for example, the proteome of an organism (Macchiato et al., 1985; Adjeroh and Nan, 2006; Benedetto et al., 2007). Estimates of differential entropy and context-free grammar complexity (Weiss et al., 2000) have shown that the complexity of such large sets is lower than the complexity of a corresponding set of random strings by approximately 1%, about one third of which is caused by well-known low-complexity regions. Evidence of weak correlations at short, medium, and long range has also been found: positive correlations appear at medium range (≥100) and decrease with distance, implying that the amino acid distribution of proteins that are close in the genome is more similar than that of proteins far in the genome. The sign of the correlation between pairs of amino acids at medium distance forms groups that resemble those traced by widely accepted physico-chemical properties. Family-dependent, short-range periodicities in hydrophobicity, α-helix propensity, and charge have also been detected (Weiss and Herzel, 1998), and have been attributed to interactions between elements of the same secondary structures.

Both in statistical and in algorithmic information theory, the search for correlations and patterns is intimately related to the construction of compact models. Since a provocative 1999 study that advocated the incompressibility of proteomes (Nevill-Manning and Witten, 1999), there has been a modest flourishing of compression techniques tuned for long concatenations of polypeptides, spanning both the substitutional and the statistical realms (Giancarlo et al., 2009). We mention, among others, techniques consisting of instantiating the PPM algorithm with contexts of multiple lengths, weighted by amino acid mutation probabilities (Nevill-Manning and Witten, 1999); searching for exact and approximate reverse complements, repeats, and weighted context trees (Matsumoto et al., 2000); partitioning amino acids according to their frequency and invoking popular text compressors (Sampath, 2003); using amino acid substitution matrices to guide the creation of Huffman codes (Hategan and Tabus, 2004); building an off-line dictionary of variable-gap subsequences, constrained to be maximal in density and extension and to occur sufficiently frequently in the dataset (Apostolico et al., 2006); using panels of weighted experts that estimate the probability of a symbol using Markov models encoding species information, local context information; and repeated and complementary reversed substrings (Cao et al., 2007). These methods achieve entropies that range from about 3.67 to 4.05 bits per symbol, while other estimations based on the k-th order Shannon formula and Zipf curves reach 2.5 bps; incorporating secondary structure information in a gambling algorithm à la Shannon lowers this bound to about 2 bps (Strait and Dewey, 1996).

As expected, the analysis of stand-alone sequences has yelded more limited results. Measures of entropy over sliding windows have been shown to separate globular and fibrous proteins (Romero et al., 1999), and Lempel-Ziv complexity has been used to predict the cellular location of proteins (Xiao et al., 2005). Adding physico-chemical information to amino acids has enabled a Fourier analysis to detect characteristic periodicities in two protein families with similar structural architectures (Rackovsky, 1998); a mapping of recoded protein sequences onto one-dimensional Brownian bridges has revealed systematic deviations from randomness that have been related to energy minimization (Pande et al., 1994). The entropy of the primary sequence has also been shown to correlate with the inverse packing density and the hydrophobicity of residues in their spatial conformations (Liao et al., 2005).

In the present article, rather than identifying information with negentropy or compressibility, we correlate it to laws that govern the abundance of suitable combinatorial substructures of polypeptide strings. Rather than focusing on windows of fixed length or on substrings, we measure the composition of subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols, and to greedily choose the leftmost occurrence of each character. Subsequences can grasp long-range correlations in strings and, at short range, are well known to encode signatures and motifs that characterize protein families (Hulo et al., 2008). Rather than examining concatenations of many non-homologous polypeptides, we measure and compare the composition of each amino acid string in isolation. In addition to using our measures for comparing and classifying, we explicitly investigate their role in separating natural molecules from random permutations. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed.

The article is organized as follows: Section 2 formalizes the notion of constrained subsequence and of class of positional equivalence. In the spirit of Colosimo and De Luca (2000), Section 3 characterizes a set of sequences as extremal, in the sense that they cannot be enriched with characters without losing some occurrence in the string. Sequences and equivalence classes are then embedded in a natural spatial representation, in which they assume the form of paths and points, respectively. Section 4 describes suitable measures on this representation, and defines the set of polypeptides that have been selected for our experiments. Measures on subsequences are used for classification in Section 5, while Section 6 studies an array of laws that, in our dataset, are seen to relate these measures to string length and to the hiatus of subsequences in our dataset. Finally, Section 7 analyzes regularities that constrain pairs of measures in random permutations of our dataset.

2. Preliminaries

Given a string s of characters from an alphabet Σ, a subsequence of s is any string v that can be obtained by removing from s one or more, not necessarily consecutive characters. An occurrence of v in s is specified by a list of positions of s matching the characters of v consecutively. The positions of s that correspond to the first (respectively, last) character of v form the left (respectively, right) list of v, denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v = \{ l_1 , l_2 , \ldots , l_L \} $$\end{document} (respectively, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal R}_v = \{ l_1 , l_2 , \ldots , l_R \} $$\end{document} ). An ω-occurrence of v in s is an occurrence such that less than ω positions of s elapse between any two consecutive characters of v. For any given entry of the left list, each substring of s containing an ω-occurrence of v is an ω-realization of v, and the ω-occurrence corresponding to the sequence of lexicographically smallest positions is called greedy.

Given an ω-occurrence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf i} = \langle i_1 , i_2 , \ldots , i_k \rangle$$\end{document} of v in s, the window of i is the word \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$w_{{ \bf i}} = s [ i_k + 1..i_k + \omega ]$$\end{document} . We extend s by the segment \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$s [ {\mid} s {\mid} + 1.. {\mid} s {\mid} + \omega ]$$\end{document} filled with the extra character \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\$ \notin \Sigma$$\end{document} , so that every ω-occurrence has a window. For any position j of s, the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal H}_j$$\end{document} of windows of the ω-occurrences of v starting at j is called the horizon of v at j. The windows of all ω-occurrences of v in s form the panorama \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v$$\end{document} of v in s. We say that symbol \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma \cup \{ \$ \} $$\end{document} is visible in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v$$\end{document} if there is at least one horizon \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal H}_j \in { \cal P}_v$$\end{document} containing it.

If ω = 1, each horizon contains only one window and the panorama cannot contain more than |Σ| + 1 windows in total. To examine a more elaborate case, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}\begin{align*}\omega = 3 , \ s = \hbox{\eqfnt ACCTATACGT}{\eqfnt \$\$\$} , \ v = \hbox{\eqfnt ATAT} , \ w = \hbox{\eqfnt ACT}.\end{align*}\end{document}

Word v has ω-occurrences: i₁ = 〈1, 4, 5, 6〉, i₂ = 〈1, 4, 7, 10〉 and i₃ = 〈5, 6, 7, 10〉. Word w has ω-occurrences: j₁ = 〈1, 2, 4〉, j₂ = 〈1, 3, 4〉, j₃ = 〈1, 3, 6〉, j₄ = 〈5, 8, 10〉 and j₅ = 〈7, 8, 10〉. Therefore, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v = \{ 1 , 5 \} $$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_w = \{ 1 , 5 , 7 \} $$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal R}_v = \{ 6 , 10 \} $$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal R}_w = \{ 4 , 6 , 10 \} $$\end{document} . Word v has panorama \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal P}_v = \{ \hbox{\eqfnt ACG,} \, {\eqfnt\$\$\$} \} $$\end{document} , in particular, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal H}_1 = \{ \hbox{\eqfnt ACG,} \, {\eqfnt\$\$\$} \} $$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal H}_5 = \{ \ { \eqfnt \$\$\$} \} $$\end{document} . Word w has panorama \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal P}_w = \{ \hbox{\eqfnt ATA, ACG,} \, {\eqfnt\$\$\$} \} $$\end{document} with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal H}_1 = \{ \hbox{\eqfnt ATA, ACG} \} $$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal H}_5 = \{ \ {\eqfnt \$\$\$} \} $$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${ \cal H}_7 = \{ \ {\eqfnt \$\$\$} \} $$\end{document} . The greedy ω-occurrences of v are i₁ and i₃, those of w are j₁, j₄ and j₅.

Note that the number of ω-occurrences of subsequences of length k occurring at the same starting position in s is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal O} ( \omega^{k - 1} )$$\end{document} , and the maximum number of ω-occurrences of a specific subsequence v in s is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal O} ( \omega^{ {\mid} v {\mid} - 1} \cdot {\mid} s {\mid} )$$\end{document} . This upper bound is tight, being attained by the pair s = A^|s|, v = A^|v|. The maximum number of distinct subsequences of length k that ω-occur in s is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal O} ( \min ( {\mid} \Sigma {\mid} ^k , {\mid} s {\mid} \cdot \omega^{k - 1} ) )$$\end{document} ; this bound is attained by Σ = {A,C,G,T}, s = (ACGT)^N, ω = 4, N >> k. The number of greedy ω-occurrences of a specific subsequence v in s is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal O} ( {\mid} s {\mid} )$$\end{document} , and the maximum number of greedy ω-occurrences of subsequences of length k that start at the same position in s is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal O} ( \min ( {\mid} \Sigma {\mid} , \omega ) ^{k - 1} )$$\end{document} . For ease of notation, from now on we will use “occurrence” to mean an ω-occurrence, and will indicate the value of ω by appending ω replicas of “$” at the end of string s. We now define some equivalence relations on subsequences.

Definition 1 (Left equivalence)

Two subsequences v and w are left equivalent, denoted v ≡ _l w, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v = { \cal L}_w$$\end{document} .

We stipulate that strings never occurring in s are assigned to the class characterized by the empty list. We also assign the left list \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ 1 , 2 , \cdots , {\mid} s {\mid} , {\mid} s {\mid} + 1 \} $$\end{document} to v = ɛ and {|s| + 1} to v = $. With this proviso, the equivalence relation ≡ _l introduced on text s is a partition of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ \Sigma^{} \cup \{ \$ \} \} $$\end{document} , containing at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\sum_{k = 1}^{ {\mid} s {\mid} } \left( \begin{matrix} {\mid} s {\mid} \\ k \end{matrix} \right) + 3 = 2^{ {\mid} s {\mid} } + 2$$\end{document} left-equivalence classes. The right equivalence* relation ≡ _r and its corresponding classes are defined symmetrically. The following properties are immediate from the definitions.

Property 1

If v ≡ _r w, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v = { \cal P}_w$$\end{document} .

Property 2

The relation ≡ _r is right-invariant, i.e., v ≡ _r w implies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$va \equiv_r wa \ \forall \ a \in \Sigma \cup \{ \$ \} $$\end{document} .

Note that v and w can have the same panorama even though they do not have the same number of occurrences in s, and that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v = { \cal P}_w$$\end{document} may occur even if the relation v ≡ _r w does not hold: for example, in s = ATCACGTCAC$$ we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\font\eqfnt=cmr10 scaled 0800\begin{document}$${\cal P}_{\hbox{\eqfnt AT}}={\cal P}_{\hbox{\eqfnt GT}}=\{ \hbox{\eqfnt CAC}\} $$\end{document} even if AT and GT are not right-equivalent.

Definition 2 (Implication)

We say that w implicates or induces v on s if for every occurrence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf i_1} = \langle i_{11} , i_{12} , \cdots , i_{1k} \rangle$$\end{document} of w there is also an occurrence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf i_2} = \langle i_{21} , i_{22} , \cdots , i_{2l} \rangle$$\end{document} of v such that (i₁₁ = i₂₁) ∧ (i_1k = i_2l).

Clearly, implication is not symmetric, e.g., with s = ACAGTTT$$$, v = AGT and w = ACT, we have that w implicates v even though v does not implicate w.

Definition 3 (Equivalence)

Two subsequences v and w of s are equivalent, denoted v ≡ w, if they implicate one another.

We say that a class of the equivalence relation ≡ is a terminal class if the list of its right occurrences is {|s|}. Every subsequence in such class is called terminal subsequence.

Lemma 1

If v ≡ w, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v = { \cal P}_w$$\end{document} ; moreover, v and w have the same horizon structure.

Proof. If v ≡ w, then v ≡ _r w; hence, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v = { \cal P}_w$$\end{document} . For a generic i, consider the set I_i of the occurrences of v starting at i: the occurrences of w that also start at i are precisely the occurrences implicating the occurrences of I_i, therefore w has a horizon at i that coincides with the one of v. ▪

Lemma 2

The equivalence relation ≡ is right-invariant.

Proof. It is immediate that the generic occurrence of wa in s is implicated by at least one occurrence of va, and vice versa. Hence, v ≡ w implies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$va \equiv wa \; \forall \; a \in \Sigma.$$\end{document} ▪

Note that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$v \equiv w \; \Rightarrow ( { \cal L}_v = { \cal L}_w ) \; \wedge \; ( { \cal R}_v = { \cal R}_w )$$\end{document} , but the converse is not true. Consider the example s = ACATCATCATCT$$$, v = AT, w = ACT, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v = { \cal L}_w = \{ 1 , 3 , 6 , 9 \} $$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal R}_v = { \cal R}_w = \{ 4 , 7 , 10 , 12 \} $$\end{document} : occurrence i₁ = 〈6, 7〉 of v does not have a corresponding occurrence of w starting at position 6 and ending at position 7, and the occurrence i₂ = 〈6, 8, 10〉 of w does not have a corresponding occurrence of v starting at position 6 and ending at position 10.

3. Special Subsequences and The ω-Suffix Space

It is of interest to single out the subsequences of s that cannot be expanded without losing support, i.e., their number of ω-occurrences in s. The following definition may be considered an extension to subsequences of the one applied to substrings in Colosimo and De Luca (2000).

Definition 4 (Special subsequence)

A string \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$v \in \Sigma^$$\end{document} occurring in s starting at positions in* \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v \neq \emptyset$$\end{document} is a special subsequence if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_{va} \subset { \cal L}_v$$\end{document} for every symbol \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma \cup \{ \$ \} $$\end{document} visible from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v$$\end{document} . String v is a non-special subsequence if there is a symbol \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma \cup \{ \$ \} $$\end{document} visible from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v$$\end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_{va} = { \cal L}_v$$\end{document} .

Note that, according to this definition, ɛ is a special subsequence. Special subsequences have the following properties.

Property 3

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i_1 , i_2 , \ldots , i_N$$\end{document} be the starting positions of v in s. Then v is a special subsequence if and only if there are two starting positions i_h and i_k such that no window in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal H}_{i_h}$$\end{document} shares a symbol with a window in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal H}_{i_k}$$\end{document} .

Property 4

If av is a special subsequence and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma$$\end{document} , then the suffix v of av is a special subsequence.

Property 5

If v is a special subsequence, then such is also any w ≡ v.

Following is the dual notion of that of a special sequence.

Definition 5 (Antispecial subsequence)

A subsequence v of s is antispecial if any extension va of v in s, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma \cup \{ \$ \} $$\end{document} , results in va ≡ _l v.

Therefore, a subsequence is antispecial if and only if every symbol with which it can be extended in s appears in every horizon. Notice that a subsequence v that is extensible in s in only one way, or such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mid} { \cal L}_v {\mid} = 1$$\end{document} , is necessarily antispecial, but an antispecial subsequence can have any support in general. It is also easy to observe that the extensions of an antispecial subsequence, its prefixes and its suffixes are not necessarily antispecial. Finally, it is easy to see that if v is antispecial, then every sequence w ≡ v is antispecial, too.

The definition of special sequence embodies a criterion to build all the ≡ _l and ≡ classes of a string s. Assuming that all the occurrences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf i_1 , i_2 , \ldots , i_m}$$\end{document} of a sequence v in s have been found, we determine \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_v = \{ i_1 , i_2 , \ldots , i_k \} $$\end{document} and then organize the windows in groups \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal H}_1 , { \cal H}_2 , \ldots , { \cal H}_k$$\end{document} related to the same starting position of the occurrences: every symbol \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma \cup \{ \$ \} $$\end{document} appearing in at least one window of the panorama \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal P}_v$$\end{document} signals the occurrence of sequence va in s, that can be linked to v by a directed arc labeled by a, establishing a parent-child relationship between the sequences. If symbol a appears in at least one window of each group of v, then va ≡ _l v, otherwise va belongs to a new ≡ _l class identified by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_{av} \subset { \cal L}_v$$\end{document} . If no child of v belongs to the same class as v, then v is a special sequence, and if va and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$wb , a \neq b \in \Sigma , v , w \in \Sigma^{}$$\end{document} , have the same left and right lists, they belong to the same ≡ class.

Like standard common subsequences, also those considered here are susceptible to a natural geometric representation. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i_1 , i_2 , \ldots , i_m$$\end{document} , m ≤ n = |s|, be the positions of symbol \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$a \in \Sigma$$\end{document} in s. Align the suffixes s[i_j.|s|] ∀ 1 ≤ j ≤ m along the m coordinate axes of a multidimensional grid, such that s[i_j* + k] occupies the k-th position along coordinate i_j. This space, denoted Ψ_a(s), will be called the ω-suffix space of s induced by character a. With the convention that the origin matches any character of Σ, we mark a matching point (in what follows, often referred to simply as a point or match) in this space at every cell \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf x} = [ x_{1} , x_{2} , \ldots , x_{m} ] \neq { \bf 0}$$\end{document} of the grid such that, ∀ 1 ≤ i ≤ m, (s[x_i] = b) ∨ (x_i = 0), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$b \in \Sigma$$\end{document} . Next, we define a partial order on the points by using a strict-dominance criterion: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align}{ \bf x} < { \bf y} \ { \rm iff } \ x_1 < y_1 , x_2 < y_2 , \ldots , x_m < y_m.\end{align}\end{document}

A greedy ω-subsequence corresponds to a chain in this partial order such that, for each pair of consecutive points x and y we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_i < y_i \ { \rm for} \ i \in [ 1 , m ]$$\end{document} , and for no point z we have x < z < y. To connect all chains related to some greedy ω-subsequences, we start at the origin and connect matches in succession under the ω constraint and in such a way that whenever an arc is established between the points x and y, then for no point z we have x < z < y. Except for the fact that here we direct the arcs from the lower point to the higher one, this process results in a partial Hasse diagram for the poset (Gratzer, 1998), that is, the portion of the diagram that is compatible with the ω constraint. Still, there are greedy ω-subsequences that are not captured in this process.

The simple construction that we now proceed to describe traces all the sequences that ω-occur greedily in s, resulting in what constitutes an expansion of the constrained Hasse diagram above. The space Ψ_a(s) sets the natural stage also for such a construction, which starts at point \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf 1} = [ 1 , 1 , \ldots , 1 ]$$\end{document} and proceeds according to the following rule. Assume that, at the generic iteration, we are at point \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf x} = [ x_1 , x_2 , \ldots , x_n ]$$\end{document} , and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_1^c , y_2^c , \ldots , y_n^c$$\end{document} be the nearest matches of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in \Sigma$$\end{document} following x in the partial order and falling within an interval of ω on every axis: in the next iteration, we move to all such points and resume the process. It is apparent that this construction explores only a subset of all matching points in Ψ_a(s), and that, in a generic space with k dimensions, it proceeds monotonically within an hyperpyramid with vertex in point 1, axis along the line passing through the points with equal coordinates, and edges identified by the k lines passing through the points \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$[ 1 + \omega , 2 , 2 , \ldots , 2 ] , [ 2 , 1 + \omega , 2 , \ldots , 2 ] , \ldots , [ 2 , 2 , \ldots , 1 + \omega ]$$\end{document} and through the vertex. This process mimics the construction of a particular trie, reminiscent of the special suffix tree of Chattaraj and Parida (2005), and for unbounded ω, the resulting graph is seen to incorporate the Hasse diagram of the poset of matches.

Let V be the set of the points in space Ψ_a(s) that are visited by the algorithm just described, and let A be the set of the arcs, oriented and labeled by symbols of Σ, that indicate the extensions of each point of V carried out by the algorithm. Graph G_a(s) = (V, A) is called the ω-suffix graph induced by symbol a on s.

Lemma 3

The points of G_a(s) represent all and only the classes of the equivalence relation ≡ among the sequences starting by a that ω-occur greedily in s.

Proof. Clearly, strings that share the same starting and ending positions in all their greedy occurrences in s are projected to the same point of Ψ_a(s). That these points identify all the equivalence classes of the ≡ relation derives from the fact that the procedure generates all the subsequences that ω-occur in a greedy way in s. Note that the left list of the class associated to point \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf x} = [ x_1 , x_2 , \ldots , x_k ]$$\end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L} = \{ i \; {\mid} \; x_i \neq 0 , \; 1 \leq i \leq k \} $$\end{document} , and that the right list is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal R} = \{ i + x_i \; {\mid} \; x_i \neq 0 , \; 1 \leq i \leq k \} $$\end{document} . ▪

Space Ψ_a(s) may have a very high number of dimensions, but it contains subspaces of smaller dimensionality:

Definition 6 (Subspace of Ψ_a(s))

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i_1 , i_2 , \ldots , i_m$$\end{document} be the list of occurrences of symbol a in s. The subspace of Ψ_a(s) associated with the distinct coordinates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i_{j_1} , i_{j_2} , \ldots , i_{j_k}$$\end{document} , is the set of points in Ψ_a(s) having non-null values only along the coordinates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i_{j_1} , i_{j_2} , \ldots , i_{j_k}$$\end{document} .

Actually, strings belonging to the same class under ≡ _l are projected to paths of G_a(s) ending at points located in the same subspace of Ψ_a(s).

In particular, the class \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal L}_a$$\end{document} associated to point 1 is made up of points with exactly m non-null coordinates, the class formed by the strings that never occur in s corresponds to the point \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf 0} = [ 0 , 0 , \ldots , 0 ]$$\end{document} having zero non-null coordinates, and the transition from the string v of a class that comprises coordinate i to the string va of a class devoid of coordinate i corresponds to the connection of the last point associated with the ≡ class of v to a point with a null value along i. A special sequence is associated to a path of the graph ending in a point x from which it is only possible to connect to points located in subspaces with fewer dimensions than x; an antispecial sequence, on the other hand, corresponds to a path ending in a point w which is able to connect only to points with the same non-null coordinates as w. From this geometric interpretation it is seen that speciality, antispeciality and the way in which the support is reduced by extension are properties common to all the sequences belonging to the same equivalence class of relation ≡ , as was noted earlier.

4. The Dataset and Its Rationale

In the following sections, we set up a battery of experiments to study the properties of the suffix graphs generated by amino acid strings and by their random permutations. Natural compositional measures that we will consider are the number of points (special, antispecial, normal, terminal), the number of arcs (internal and external) and the number of subsequences (special, antispecial, normal, terminal) at different values of ω. The following sections will aggregate and systematize over one million data points.

Many previous investigations (Nakai et al., 1988; Weiss and Herzel, 1998; Weiss et al., 2000; Lapinsh et al., 2002; Li et al., 2003b) have recoded the original amino acid strings with reduced alphabets that incorporate physico-chemical scales. The large number of scales published (Kawashima et al., 2008), and the lack of a standard methodology to perform such recoding haunted our preliminary experiments with many parameters that made the analysis dependent on the fine-tuning of the scale quantization. Amino acid similarities are also known not to be universal: at different positions of a protein, different sets of amino acids or of amino acid substrings can more likely substitute for one another (Han and Baker, 1995), making a fixed substitution scheme less biochemically significant. The necessity to make our results as general as possible led us to analyze polypeptides encoded in the original alphabet of amino acids.

Most proteins consist of modular subunits (called domains) whose spatial conformation and function are thought to be independent of other parts of the protein. A limited number of highly similar domains are seen to occur in all known proteins, both as parts of larger multidomain structures and by themselves (Richardson, 1981; Murzin et al., 1995; Orengo et al., 1997), suggesting that they are remnants of ancient functional polypeptides that have been assembled by evolution to produce the combinatorial variety of structures and functions that appear in modern proteins. The modular nature of domains makes them better candidates than whole proteins for investigating regularities and patterns, since the concatenation of different domains could be a source of noise. The sequential compactness and the moderate length of domains make them preferable to secondary structures, supersecondary structures, or motifs, that are typically shorter and non-contiguous subsequences.

The SCOP (Structural Classification of Protein) database (Murzin et al., 1995) is a comprehensive ordering of all protein domains of known structure according to their evolutionary, structural and functional similarity. The basic classification unit is the domain, that is put at the leaves of a tree having three more hierarchical levels: families, containing domains with a common evolutionary origin as testified by high sequence similarity or highly similar function and structure; superfamilies, containing domains with low sequence similiarity but sharing structural and functional features that suggest a common evolutionary origin; folds, containing domains with a specific set of major secondary structures, a specific configuration of these structures in space, and a specific connection pattern; and classes, containing domains that share the same frequency of secondary structures (e.g., domains in which the large majority of secondary structures are α-helices). This classification is manually curated by biologists.

We elect a subset of 148 domains of SCOP as our main dataset: we will refer to this dataset as D₁ in what follows (Table 1). We choose domains to span two different classes (1 and 2), two different folds per class (1.1, 1.8, and 2.1, 2.2), two different superfamilies per fold (1.1.1, 1.1.2, 1.8.1, 1.8.4, and 2.1.1, 2.1.2, 2.2.2, 2.2.3), and two different families per superfamily² (1.1.1.3, 1.1.1.4, 1.1.2.1, 1.1.2.2, 1.8.1.1, 1.8.1.2, 1.8.4.5, 1.8.4.6, and 2.1.1.1, 2.1.1.2, 2.1.2.1, 2.1.2.2, 2.2.3.2, 2.2.3.3, 2.2.2.1, 2.2.2.2). The purpose of this selection is twofold: on one hand, we want to determine at which level of accuracy different measures on suffix graphs reconstruct the SCOP classification. For example, if a measure correctly separates domains in different classes but not in different folds, we can conjecture that the measure grasps information encoded in the dominant secondary structures and not in their spatial arrangement. On the other hand, we want to test whether the primary sequence of domains systematically differs from random strings, and if so whether this difference is a widespread phenomenon or it is confined to specific leaves of the classification. To do so, we analyze 100 random permutations of each string in D₁.

Table 1.
Strings Composing D₁, with Their SCOP Family and Their ASTRAL Identifier (Chandonia et al., 2004)

Family ASTRAL Name, species

1.1.1.3 d3sdha_ a.1.1.2 (A:) Hemoglobin I, Scapharca inaequivalvis

1.1.1.3 d1b0ba_ a.1.1.2 (A:) Hemoglobin I, Lucina pectinata

1.1.1.3 d1h97a_ a.1.1.2 (A:) Trematode hemoglobin/myoglobin, Paramphistomum epiclitum

1.1.1.3 d1jl7a_ a.1.1.2 (A:) Glycera globin, Glycera dibranchiata

1.1.1.3 d1a6ma_ a.1.1.2 (A:) Myoglobin, Physeter catodon

1.1.1.3 d1mbaa_ a.1.1.2 (A:) Myoglobin, Aplysia limacina

1.1.1.3 d1mbsa_ a.1.1.2 (A:) Myoglobin, Phoca vitulina

1.1.1.3 d1ecoa_ a.1.1.2 (A:) Erythrocruorin, Chironomus thummi thummi, fraction III

1.1.1.3 d2gdma_ a.1.1.2 (A:) Leghemoglobin, Lupinus luteus

1.1.1.3 d1fsla_ a.1.1.2 (A:) Leghemoglobin, Glycine max, isoform A

1.1.1.3 d1d8ua_ a.1.1.2 (A:) Non-symbiotic plant hemoglobin, Oryza sativa

1.1.1.3 d1irda_ a.1.1.2 (A:) Hemoglobin alpha-chain, Homo sapiens

1.1.1.3 d2d5xa1 a.1.1.2 (A:1-141) Hemoglobin alpha-chain, Equus caballus

1.1.1.3 d1hdsa_ a.1.1.2 (A:) Hemoglobin alpha-chain, Odocoileus virginianus

1.1.1.3 d1irdb_ a.1.1.2 (B:) Hemoglobin beta-chain, Homo sapiens

1.1.1.3 d2d5xb1 a.1.1.2 (B:1-146) Hemoglobin beta-chain, Equus caballus

1.1.1.3 d1hdsb_ a.1.1.2 (B:) Hemoglobin beta-chain, Odocoileus virginianus

1.1.1.3 d1it2a_ a.1.1.2 (A:) Hagfish hemoglobin, Eptatretus burgeri

1.1.1.3 d1phna_ a.1.1.3 (A:) Phycocyanin alpha subunit, Cyanidium caldarium

1.1.1.4 d1f99a_ a.1.1.3 (A:) Phycocyanin alpha subunit, Polysiphonia urceolata

1.1.1.4 d1cpca_ a.1.1.3 (A:) Phycocyanin alpha subunit, Fremyella diplosiphon

1.1.1.4 d1phnb_ a.1.1.3 (B:) Phycocyanin beta subunit, Cyanidium caldarium

1.1.1.4 d1f99b_ a.1.1.3 (B:) Phycocyanin beta subunit, Polysiphonia urceolata

1.1.1.4 d1cpcb_ a.1.1.3 (B:) Phycocyanin beta subunit, Fremyella diplosiphon

1.1.1.4 d1alla_ a.1.1.3 (A:) Allophycocyanin alpha subunit, Spirulina platensis

1.1.1.4 d1b33a_ a.1.1.3 (A:) Allophycocyanin alpha subunit, Mastigocladus laminosus

1.1.1.4 d1kn1a_ a.1.1.3 (A:) Allophycocyanin alpha subunit, Porphyra yezoensis

1.1.1.4 d1allb_ a.1.1.3 (B:) Allophycocyanin beta subunit, Spirulina platensis

1.1.1.4 d1b33b_ a.1.1.3 (B:) Allophycocyanin beta subunit, Mastigocladus laminosus

1.1.1.4 d1kn1b_ a.1.1.3 (B:) Allophycocyanin beta subunit, Porphyra yezoensis

1.1.1.4 d1liaa_ a.1.1.3 (A:) Phycoerythrin alpha subunit, Polysiphonia urceolata

1.1.1.4 d1b8da_ a.1.1.3 (A:) Phycoerythrin alpha subunit, Griffithsia monilis

1.1.1.4 d1eyxa_ a.1.1.3 (A:) Phycoerythrin alpha subunit, Gracilaria chilensis

1.1.1.4 d1liab_ a.1.1.3 (B:) Phycoerythrin beta subunit, Polysiphonia urceolata

1.1.1.4 d1b8db_ a.1.1.3 (B:) Phycoerythrin beta subunit, Griffithsia monilis

1.1.1.4 d1eyxb_ a.1.1.3 (B:) Phycoerythrin beta subunit, Gracilaria chilensis

1.1.2.1 d1nekb1 a.1.2.1 (B:107-238) Succinate dehydogenase, Escherichia coli

1.1.2.1 d1kf6b1 a.1.2.1 (B:106-243) Fumarate reductase, Escherichia coli

1.1.2.1 d2bs2b1 a.1.2.1 (B:107-239) Fumarate reductase, Wolinella succinogenes

1.1.2.2 d1h7wa1 a.1.2.2 (A:2-183) Dihydropyrimidine dehydrogenase, N-terminal domain, Sus scrofa

1.8.1.1 d1p7ia_ a.4.1.1 (A:) Engrailed Homeodomain, Drosophila melanogaster

1.8.1.1 d1akha_ a.4.1.1 (A:) Mating type protein A1 Homeodomain, Saccharomyces cerevisiae

1.8.1.1 d1k61a_ a.4.1.1 (A:) mat alpha2 Homeodomain, Saccharomyces cerevisiae

1.8.1.1 d1lfba_ a.4.1.1 (A:) Hepatocyte nuclear factor 1a (LFB1/HNF1), Rattus rattus

1.8.1.1 d1ic8a1 a.4.1.1 (A:201-276) Hepatocyte nuclear factor 1a (LFB1/HNF1), Homo sapiens

1.8.1.1 d1e3oc1 a.4.1.1 (C:104-160) Oct-1 POU Homeodomain, Homo sapiens

1.8.1.1 d1au7a1 a.4.1.1 (A:103-160) Pit-1 POU homeodomain, Rattus norvegicus

1.8.1.1 d1ftta_ a.4.1.1 (A:) Thyroid transcription factor 1 homeodomain, Rattus norvegicus

1.8.1.1 d1hdpa_ a.4.1.1 (A:) Oct-2 POU Homeodomain, Homo sapiens

1.8.1.1 d1ocpa_ a.4.1.1 (A:) Oct-3 POU Homeodomain, Mus musculus

1.8.1.1 d1b72a_ a.4.1.1 (A:) Homeobox protein hox-b1, Homo sapiens

1.8.1.2 d1gdta1 a.4.1.2 (A:141-183) gamma,delta resolvase, (C-terminal domain), Escherichia coli

1.8.1.2 d1tc3c_ a.4.1.2 (C:) Transposase tc3a1-65, Caenorhabditis elegans

1.8.1.2 d2ezla_ a.4.1.2 (A:) Ibeta subdomain of the mu end DNA-binding domain of phage mu transposase, Bacteriophage mu

1.8.1.2 d2ezia_ a.4.1.2 (A:) Transposase, Bacteriophage mu

1.8.4.5 d1i5za1 a.4.5.4 (A:138-206) Catabolite gene activator protein (CAP), C-terminal domain, Escherichia coli

1.8.4.5 d1ft9a1 a.4.5.4 (A:134-213) CO-sensing protein CooA, C-terminal domain, Rhodospirillum rubrum

1.8.4.5 d2bgca1 a.4.5.4 (A:138-237) Listeriolysin regulatory protein PrfA, C-terminal domain, Listeria monocytogenes

1.8.4.5 d2gaua1 a.4.5.4 (A:152-232) Transcriptional regulator PG0396, C-terminal domain, Porphyromonas gingivalis

1.8.4.5 d1zyba1 a.4.5.4 (A:148-220) Probable transcription regulator BT4300, C-terminal domain, Bacteroides thetaiotaomicron

1.8.4.5 d2coha1 a.4.5.4 (A:118-199) Transcriptional regulator TTHA1359, C-terminal domain, Thermus thermophilus

1.8.4.6 d1i1ga1 a.4.5.32 (A:2-61) LprA, Archaeon Pyrococcus furiosus

1.8.4.6 d2cyya1 a.4.5.32 (A:5-64) Putative transcriptional regulator PH1519, Archaeon Pyrococcus horikoshii

1.8.4.6 d2cg4a1 a.4.5.32 (A:4-66) Regulatory protein AsnC, Escherichia coli

1.8.4.6 d2cfxa1 a.4.5.32 (A:1-63) Transcriptional regulator LrpC, Bacillus subtilis

2.1.1.1 d1bwwa_ b.1.1.1 (A:) Immunoglobulin light chain kappa variable domain VL-kappa, Homo sapiens cluster 1

2.1.1.1 d1mjul1 b.1.1.1 (L:1-107) Immunoglobulin light chain kappa variable domain VL-kappa, Mus musculus cluster 1.1

2.1.1.1 d1lk3l1 b.1.1.1 (L:1-106) Immunoglobulin light chain kappa variable domain VL-kappa, Rattus norvegicus

2.1.1.1 d1pewa_ b.1.1.1 (A:) Immunoglobulin light chain lambda variable domain VL-lambda, Homo sapiens cluster 1

2.1.1.1 d1mfal1 b.1.1.1 (L:1-111) Immunoglobulin light chain lambda variable domain VL-lambda, Mus musculus

2.1.1.1 d1oaql_ b.1.1.1 (L:) Immunoglobulin light chain lambda variable domain VL-lambda, Rattus norvegicus

2.1.1.1 d1rzfh1 b.1.1.1 (H:2-113) Immunoglobulin heavy chain variable domain VH, Homo sapiens cluster 1

2.1.1.1 d1dlfh_ b.1.1.1 (H:) Immunoglobulin heavy chain variable domain VH, Mus musculus cluster 1

2.1.1.1 d1oaqh_ b.1.1.1 (H:) Immunoglobulin heavy chain variable domain VH, Rattus norvegicus

2.1.1.1 d1xfpa_ b.1.1.1 (A:) Camelid IG heavy chain variable domain VHh, Camelus dromedarius

2.1.1.1 d1sjva_ b.1.1.1 (A:) Camelid IG heavy chain variable domain VHh, Lama glama

2.1.1.1 d1h5ba_ b.1.1.1 (A:) T-cell antigen receptor, Mus musculus alpha-chain

2.1.1.1 d1ogad1 b.1.1.1 (D:3-117) T-cell antigen receptor, Homo sapiens alpha-chain

2.1.1.1 d1beca1 b.1.1.1 (A:3-117) T-cell antigen receptor, Mus musculus beta-chain

2.1.1.1 d1i8lc_ b.1.1.1 (C:) Immunoreceptor CTLA-4 (CD152) N-terminal fragment, Homo sapiens

2.1.1.1 d1dqta_ b.1.1.1 (A:) Immunoreceptor CTLA-4 (CD152) N-terminal fragment, Mus musculus

2.1.1.1 d1neua_ b.1.1.1 (A:) Myelin membrane adhesion molecule P0, Rattus norvegicus

2.1.1.1 d1pkoa_ b.1.1.1 (A:) Myelin oligodendrocyte glycoprotein (MOG), Rattus norvegicus

2.1.1.1 d1eaja_ b.1.1.1 (A:) Coxsackie virus and adenovirus receptor (Car), domain 1, Homo sapiens

2.1.1.1 d1qfoa_ b.1.1.1 (A:) N-terminal domain of sialoadhesin, Mus musculus

2.1.1.2 d1a3rl2 b.1.1.2 (L:115-214) Immunoglobulin light chain kappa constant domain CL-kappa, Mus musculus

2.1.1.2 d1lk3l2 b.1.1.2 (L:107-210) Immunoglobulin light chain kappa constant domain CL-kappa, Rattus norvegicus

2.1.1.2 d1c5cl2 b.1.1.2 (L:108-214) Immunoglobulin light chain kappa constant domain CL-kappa, Homo sapiens

2.1.1.2 d1mfbl2 b.1.1.2 (L:112-212) Immunoglobulin light chain lambda constant domain CL-lambda, Mus musculus

2.1.1.2 d1q0xl2 b.1.1.2 (L:108-212) Immunoglobulin light chain lambda constant domain CL-lambda, Homo sapiens

2.1.1.2 d1nfde2 b.1.1.2 (E:108-215) Immunoglobulin light chain lambda constant domain CL-lambda, Cricetulus griseus

2.1.1.2 d1c5ch2 b.1.1.2 (H:114-230) Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Homo sapiens

2.1.1.2 d1mjuh2 b.1.1.2 (H:114-230) Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Mus musculus

2.1.1.2 d1lk3h2 b.1.1.2 (H:120-219) Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Rattus norvegicus

2.1.1.2 d2fbjh2 b.1.1.2 (H:119-220) Immunoglobulin heavy chain alpha constant domain 1, CH1-alpha, Mus musculus

2.1.1.2 d1dn0b2 b.1.1.2 (B:121-225) Immunoglobulin heavy chain mu constant domain 1, CH1-mu, Homo sapiens

2.1.1.2 d1l6xa1 b.1.1.2 (A:237-341) Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Homo sapiens

2.1.1.2 d1igyb3 b.1.1.2 (B:236-361) Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Mus musculus

2.1.1.2 d1i1ca1 b.1.1.2 (A:239-341) Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Rattus norvegicus

2.1.1.2 d1l6xa2 b.1.1.2 (A:342-443) Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Homo sapiens

2.1.1.2 d1cqka_ b.1.1.2 (A:) Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Mus musculus

2.1.1.2 d1i1ca2 b.1.1.2 (A:342-443) Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Rattus norvegicus

2.1.1.2 d1ow0a1 b.1.1.2 (A:242-342) Immunoglobulin heavy chain alpha constant domain 2, CH2-alpha, Homo sapiens

2.1.1.2 d1ow0a2 b.1.1.2 (A:343-450) Immunoglobulin heavy chain alpha constant domain 3, CH3-alpha, Homo sapiens

2.1.1.2 d1o0va1 b.1.1.2 (A:228-330) Immunoglobulin heavy chain epsilon constant domain 2, CH2-epsilon, Homo sapiens

2.1.2.1 d1p7hl1 b.1.18.1 (L:576-678) T-cell transcription factor NFAT1 (NFATC2), Homo sapiens

2.1.2.1 d1imhc1 b.1.18.1 (C:368-468) T-cell transcription factor NFAT5 (TONEBP), Homo sapiens

2.1.2.1 d1u3ya_ b.1.18.1 (A:) p50 subunit of NF-kappa B transcription factor, Homo sapiens

2.1.2.1 d1bfsa_ b.1.18.1 (A:) p50 subunit of NF-kappa B transcription factor, Mus musculus

2.1.2.1 d1a3qa1 b.1.18.1 (A:227-327) p52 subunit of NF-kappa B (NFKB), Homo sapiens

2.1.2.1 d1bfta_ b.1.18.1 (A:) p65 subunit of NF-kappa B (NFKB) dimerization domain, Mus musculus

2.1.2.1 d1my7a_ b.1.18.1 (A:) p65 subunit of NF-kappa B (NFKB) dimerization domain, Homo sapiens

2.1.2.1 d1gjia1 b.1.18.1 (A:182-281) p65 subunit of NF-kappa B (NFKB) dimerization domain, Gallus gallus C-rel

2.1.2.1 d1ttua1 b.1.18.1 (A:542-660) DNA-binding protein LAG-1 (CSL), Caenorhabditis elegans

2.1.2.1 d2cxka1 b.1.18.1 (A:872-953) Calmodulin binding transcription activator 1, Homo sapiens

2.1.2.2 d1gofa1 b.1.18.2 (A:538-639) Galactose oxidase C-terminal domain, Dactylium dendroides

2.1.2.2 d1k3ia1 b.1.18.2 (A:538-639) Galactose oxidase C-terminal domain, Fusarium sp.

2.1.2.2 d1w8oa1 b.1.18.2 (A:403-505) Sialidase “linker” domain, Micromonospora viridifaciens

2.1.2.2 d1clca2 b.1.18.2 (A:35-134) CelD cellulase N-terminal domain, Clostridium thermocellum

2.1.2.2 d1ut9a2 b.1.18.2 (A:208-305) Cellulose 1,4-beta-cellobiosidase CbhA precatalytic domain, Clostridium thermocellum

2.1.2.2 d1f1sa2 b.1.18.2 (A:171-248) Hyaluronate lyase precatalytic domain, Streptococcus agalactiae

2.1.2.2 d1qbaa1 b.1.18.2 (A:781-885) Bacterial chitobiase (N-acetyl-beta-glucoseaminidase) C-terminal domain, Serratia marcescens

2.1.2.2 d1kcla1 b.1.18.2 (A:496-581) Cyclomaltodextrin glycanotransferase domain D, Bacillus circulans, different strains

2.1.2.2 d1cyga1 b.1.18.2 (A:492-574) Cyclomaltodextrin glycanotransferase domain D, Bacillus stearothermophilus

2.1.2.2 d1pama1 b.1.18.2 (A:497-582) Cyclomaltodextrin glycanotransferase domain D, Bacillus sp., strain 1011

2.1.2.2 d1ciua1 b.1.18.2 (A:496-578) Cyclomaltodextrin glycanotransferase domain D, Thermoanaerobacterium thermosulfurigenes EM1

2.1.2.2 d1qhoa1 b.1.18.2 (A:496-576) Five domain “maltogenic” alpha-amylase (glucan 1,4-alpha-maltohydrolase), domain D, Bacillus stearothermophilus

2.1.2.2 d1gvia1 b.1.18.2 (A:1-123) Maltogenic amylase N-terminal domain N, Thermus sp.

2.1.2.2 d1ea9c1 b.1.18.2 (C:1-121) Maltogenic amylase N-terminal domain N, Bacillus sp. cyclomaltodextrinase

2.1.2.2 d1ji1a1 b.1.18.2 (A:1-122) Maltogenic amylase N-terminal domain N, Thermoactinomyces vulgaris TVAI

2.1.2.2 d1j0ha1 b.1.18.2 (A:1-123) Neopullulanase N-terminal domain, Bacillus stearothermophilus

2.2.3.2 d1n67a1 b.2.3.4 (A:229-369) Clumping factor A, Staphylococcus aureus

2.2.3.2 d1r17a1 b.2.3.4 (A:276-424) Fibrinogen-binding adhesin SdrG, Staphylococcus epidermidis

2.2.3.3 d1uwfa1 b.2.3.2 (A:1-158) Mannose-specific adhesin FimH, Escherichia coli

2.2.3.3 d1pdkb_ b.2.3.2 (B:) PapK pilus subunit, Escherichia coli

2.2.3.3 d1n12a_ b.2.3.2 (A:) PapE pilus subunit, Escherichia coli

2.2.3.3 d1p5vb_ b.2.3.2 (B:) F1 capsule antigen Caf1, Yersinia pestis

2.2.3.3 d2co3a1 b.2.3.2 (A:10-142) SafA pilus subunit, Salmonella typhimurium

2.2.2.1 d1exha_ b.2.2.1 (A:) Exo-1,4-beta-D-glycanase (cellulase xylanase) cellulose-binding domain CBD, Cellulomonas fimi

2.2.2.1 d1xbda_ b.2.2.1 (A:) Endo-1,4-beta xylanase D xylan binding domain XBD, Cellulomonas fimi

2.2.2.2 d1nbca_ b.2.2.2 (A:) Cellusomal scaffolding protein A, scaffoldin, Clostridium thermocellum

2.2.2.2 d1g43a_ b.2.2.2 (A:) Cellusomal scaffolding protein A, scaffoldin, Clostridium cellulolyticum

2.2.2.2 d1tf4a2 b.2.2.2 (A:461-605) Endo/exocellulase:cellobiose E-4 C-terminal domain, Thermomonospora fusca

2.2.2.2 d1g87a2 b.2.2.2 (A:457-614) Endo/exocellulase:cellobiose E-4 C-terminal domain, Clostridium cellulolyticum atcc 35319

2.2.2.2 d1aoha_ b.2.2.2 (A:) Cohesin domain, Clostridium thermocellum, cellulosome, various modules

2.2.2.2 d1zv9a1 b.2.2.2 (A:3-173) Cellulosomal scaffoldin adaptor protein B, ScaB, Acetivibrio cellulolyticus

2.2.2.2 d1tyja1 b.2.2.2 (A:2-171) Cellulosomal scaffoldin ScaA, Bacteroides cellulosolvens

2.2.2.2 d2bm3a1 b.2.2.2 (A:5-166) Scaffolding dockerin binding protein A SdbA, Clostridium thermocellum

As shown in Figure 1, all domains use 15–20 symbols, and have empirical entropy³ of 3.5–4.2; however, entropy in the same SCOP leaf can vary widely inside this range. Conversely, string length and the compression ratio achieved by a popular string compressor, are uniform inside many leaves (notably 1.1.1.3, 1.1.1.4, 1.1.2.1, 1.8.4.6, 2.1.1.1, 2.1.1.2, 2.1.2.1, 2.1.2.2, 2.2.3.3, 2.2.2.2), and they display trends that are very similar to each other. Not surprisingly, a significant proportion of domains are either expanded or not compressed.

FIG. 1.
Measures on strings in D₁ ∪ D₂. Empirical entropy is computed using logarithms to base 2, where 0 · log ₂(0) is set to 0. Compression ratio is defined as (|s|-|s′|)/|s|, where s is the original string, s′ is its compression with gzip –best, and | · | is the size in bytes.

Modern proteins do not consist entirely of domains: some regions have no fixed spatial configuration under physiological conditions, but are capable of dynamically transitioning through an ensemble of structures (Richardson, 1981; Wright and Dyson, 1999; Sickmeier et al., 2007). The flexibility of these unstructured (or disordered) segments allows them to fold and bind to a target simultaneously, transitioning from disorder to order according to their biochemical environment: this allows a single protein to bind multiple targets, and different proteins to bind the same target, an important feature in signalling and regulation networks. The fluctuation of spatial conformation is also exploited to create regions of exclusion in space, to facilitate phosphorylation and acetylation, and to capture small molecules. In disordered regions, the relationship between sequence and structure is different than in typical folded domains: disordered regions are known to be enriched in charged and polar, and depleted in hydrophobic residues. Along with other chemical and spatial indicators, these biases have been used to construct various disorder prediction heuristics (Li et al., 2000) and to classify disordered regions into subclasses. From the purely syntactic point of view, disordered regions tend to have low entropy (Romero et al., 2000; Weathers et al., 2006), however some disordered sequences have high entropy and some low-entropy sequences are not disordered.

DisProt (Sickmeier et al., 2007) is a comprehensive functional classification of all polypeptide regions for which there is experimental evidence of disorder. We collect a subset of 23 regions of DisProt in a secondary dataset (called D₂ in what follows; Table 2); the choice of strings in this set is again arbitrary, except that for efficiency and consistency we consider only proteins having a single disordered region of length at most 200. The purpose of this dataset is twofold: on one hand, we want to test whether disordered regions differ from domains according to measures on suffix graphs. We conjecture that if a measure clearly separates D₁ from D₂, then it grasps information that only polypeptides with a fixed spatial conformation encode. On the other hand, we want to test whether disordered regions can be distinguished from random strings, and whether such difference resembles those that intercur between domains and random strings. To do so, we analyze again 100 random permutations of each string in D₂.

Table 2.
Strings Composing D₂, with Their DisProt Identifier and Their Location in the Containing Polypeptides

DisProt Containing polypeptide Location

DP00001 60S acidic ribosomal protein P1-B 1-108

DP00002 60S acidic ribosomal protein P2-beta 1-110

DP00004 Cathelicidin antimicrobial peptide 134-170

DP00005 Antitermination protein N 1-107

DP00006 Cytochrome c 1-104

DP00009 Transcription initiation factor IIA small subunit 89-103

DP00012 Cystic fibrosis transmembrane conductance regulator 708-831

DP00013 Choriogonadotropin subunit beta 112-145

DP00015 cAMP-dependent protein kinase inhibitor alpha 1-75

DP00016 Cyclin-dependent kinase inhibitor 1 1-164

DP00018 Cyclin-dependent kinase inhibitor 1B 22-106

DP00019 Cytochrome b-c1 complex subunit Rieske, mitochondrial 1-45

DP00020 DNA-binding protein RAP1 482-512

DP00022 EMB-1 protein 1-92

DP00024 Protein E7 1-98

DP00025 Fibronectin-binding protein A 745-873

DP00027 Negative regulator of flagellin synthesis 1-97

DP00028 Eukaryotic translation initiation factor 4E-binding protein 1 1-118

DP00031 Glycine N-methyltransferase 1-40

DP00032 Glycyl-tRNA synthetase 91-158

DP00034 Attachment protein G3P 236-274

DP00035 Guanine nucleotide-binding protein G(i), alpha-1 subunit 1-31

DP00038 Non-histone chromosomal protein HMG-14 1-99

As shown in Figure 1, just 8 regions in D₂ use less than 15 symbols, and just 5 regions have empirical entropy less than 3.5, the minima in D₁. Furthermore, just 8 regions have positive compression ratio, and compression ratios in D₂ are never larger than the largest compression ratio achieved in D₁. Therefore, strings in D₂ do not appear as systematically “less complex” than strings in D₁. Six disordered regions have compression ratio smaller than − 0.58, the minimum in D₁, but this does not allow to conclude that disordered regions are systematically “more complex” than strings in D₁, either.

5. Classifying with Suffix Graphs

Figure 1 shows that entropy and number of symbols cannot be used to group the dataset into SCOP regions, but that compression ratio and length are largely constant inside some SCOP leaves, and that values in different leaves can be clustered around approximately three levels: “high” (for folds 1.1 and 2.2), “medium” (for 2.1), and “low” (for 1.8). We want to test whether any measure on suffix graphs achieves a similar grouping, and whether these groups have a parallel in the classification of SCOP.

No single measure considered in its absolute value displays such a clustered behavior,⁴ however groups of similar values seem to appear when measures are taken relative to the total number of their respective elements: for example, the number of special points divided by the total number of points (Fig. 2) assumes approximately three distinct values at every ω < 3: “low” (for folds 1.1 and 2.2), “medium” (for 2.1), and “high” (for 1.8); D₂ does not form a group of its own: some disordered regions assume values that are significantly larger than domains, while others fall in the range of D₁. A similar pattern occurs in antispecial points, internal arcs and external arcs at ω < 3; the uniformity inside SCOP leaves, on the other hand, decreases in all measures when increasing ω beyond 3. Terminal points are an exception: both their clustered trend and their values remain approximately the same up to ω = 8.

FIG. 2.
Number of special points divided by the total number of points across D₁ (bottom) and D₁ ∪ D₂ (top) at ω = 2. Similar trends appear at ω = 1, 3.

This consistency across different measures suggests that the relative abundance of points and arcs could be the foundation of a coherent clustering criterion. Representing each string as a vector in the four-dimensional simplex of the relative number of special, antispecial, normal and terminal points, and setting ω = 1, 2, 3, 4, some disordered regions (respectively 8, 7, 9, and 7 out of the 23 of D₂) are separated from the rest of D₁ ∪ D₂ into an independent subtree, accompanied by some members of fold 1.8 (Fig. 3). This group of strong outliers is partitioned, in its turn, into two or three well-separated subgroups at all ω ≤ 4. The following strings are outliers at ω ≤ 4:
DisProt 34: GGGSGGGSGGGSEGGGSEGGGSEGGGSEGGGSGGGSGSG, part of attachment protein G3P;

DisProt 9: DSHRDASQNGSGDSQ, part of transcription initiation factor IIA small subunit;

DisProt 13: DPRFQDSSSSKAPPPSLPSPSRLPGPSDTPILPQ, part of choriogonadotropin subunit beta;

FIG. 3.
UPGMA trees between four-dimensional vectors that collect the relative number of special, antispecial, normal, and terminal points of each string in D₁ ∪ D₂. Distances are Euclidean. Branch lengths are drawn proportional to distance. Disordered regions are denoted with “Dis.”

and the following additional strings are outliers at ω ≤ 3:
DisProt 35: GCTLSAEDKAAVERSKMIDRNLREDGEKAAR, part of guanine nucleotide-binding protein G(i), alpha-1 subunit;

DisProt 4: LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES, part of cathelicidin antimicrobial peptide.

These strings are among the shortest in D₂, but only DisProt 34 stands out for its apparent regularity, and is put farthest from the rest of the dataset in the trees. At ω ≤ 3 most strings in fold 1.8, as well, are strongly separated from the rest of D₁: these strings form two subtrees, one containing some strings in D₂ and the other containing some strings in D₂ and 2.1. A closer look at the densities at ω < 4 (Fig. 5) shows that the average number of special, normal and terminal points in D₂ is at least 2, 4, and 2 times larger than the respective values in D₁, respectively, and that the average number of terminal points in fold 1.8 is approximately two times larger than the average number of terminal points in the rest of D₁. Strong outliers in D₂ have an average number of special, normal and terminal points that is at least 3, 5, and 2 times the rest of D₂, respectively, while special and normal points do not separate 1.8 from D₁ at any ω > 2.

At ω = 5, the shape of the tree seems to experience a phase transition: there are still at least three sharp partitions with clear subpartitions, but none of these subtrees either consists predominantly of disordered regions, or reflects a SCOP class. At ω = 6, another transition seems to occur: the tree becomes separated into two groups associated with folds 1.1–2.2 and 2.1–1.8, respectively, but the interior of each group does not further reflect the SCOP hierarchy or the division between D₁ and D₂. At ω = 8 the tree becomes organized into 3 groups: one containing members of D₂, 1.8 and 2.1, another one containing strings in 2.1, and the third containing strings in 1.1–2.2. DisProt 34, 9 and 13 are put consistently far from the other groups at all ω ≥ 5, but they never form an independent subtree.

By discarding branch length information, more details about the topology of the trees can be appreciated (Fig. 4). At ω = 1, three distinct subtrees⁵ for folds 1.1, 2.1, and for a mixture of 1.8 and D₂ can be identified, as well as subtrees for families 2.1.1.1, 2.1.1.2, 1.1.1.3, and 1.1.1.4. Most members of fold 2.2 lie in the subtree of fold 1.1, while the disordered regions that are not strong outliers are dispersed mainly in the 2.1 and in the 1.8-D₂ subtree. Setting ω = 2 shows a similar structure, but in this case the 1.1 subtree contains a large subtree of 2.1 strings, and families 2.1.1.1 and 2.1.1.2 do not form compact trees. Fold 2.2 still tends to merge with fold 1.1. At ω = 3 only the subtree of strings in 1.8 and D₂ survives, while the rest of the tree does not resemble SCOP any more. At ω = 4, 5 no subtree keeps a strong resemblance to a SCOP group. Finally, at ω ≥ 6, no additional structure can be detected inside the subtrees marked in Figure 3.

FIG. 4.
The trees of Figure 3 with branch lengths not drawn proportional to distance. Each string is labeled by the identifier of its SCOP family; disordered regions are labeled by their DisProt ID. White dots mark subtrees for which a SCOP group exists whose Jaccard coefficient with the subtree is at least 0.3; black dots mark subtrees for which a SCOP group exists whose Jaccard coefficient with the subtree is at least 0.5. The identifiers of the SCOP groups that have maximum Jaccard coefficient with each subtree are noted near the dots.

FIG. 5.
Relative proportion of special points (black, bottom), antispecial points (light gray), normal points (black, top), terminal points (white) at ω ≤ 4. On the horizontal axis, from left to right (separated by dashed lines): the subset of strong outliers of D₂, the rest of D₂, fold 1.8, and the rest of D₁. For clarity of presentation, not all members of D₁ are shown.

The topological phase transition that takes place at ω = 6 can be assessed quantitatively using the approach proposed in Ferragina et al. (2007). Given a threshold distance τ, we treat a distance matrix M as a binary classifier that puts strings i and j in the same group iff M(i, j) < τ. Using a specific level of the SCOP tree as reference, we measure the false positive (FP) and the true positive (TP) rate of this classifier for different choices of τ: plotting these values on the FP/TP space traces the receiver operating characteristic (ROC) of the classifier. The characteristic of a classifier that perfectly reproduces the SCOP classification would be locus {FP = 0} ∪ {TP = 1}, while the ROC of a random guess would be {FP = TP}; in general, the larger the area under the curve, the closer to SCOP is the classificator. The ROC analysis for folds, superfamilies and families (Fig. 6) shows that adherence to SCOP decreases when ω is increased from 1 to 5, but then increases when ω is set to 6, 7, and 8. At the class level, the increase of classification ability at high ω is less evident. Classification performance is approximately the same at the fold, superfamily and family level.

FIG. 6.
The relative composition of points used as a classifier of SCOP levels. Gray dotted lines are ROC curves for ω ≤ 5; black lines are ROC curves for ω ≥ 6, with ω increasing in the verse of the arrow. D₂ is considered only at the class level.

Describing the trees produced by different choices of the tree-construction algorithm, as well as fine-tuning the distance and the combination of measures to achieve the best resemblance to SCOP, is beyond the scope of this paper. Here we just note that arcs give results similar to points, but the relative proportion of special, antispecial, normal and terminal subsequences is not capable of partitioning D₁ into SCOP groups for any ω > 1 (Fig. 7): few strings are put very far from the rest of the dataset (among which DisProt 34 and DisProt 9), but this group of outliers does not correspond to any SCOP cluster. Other well-separated subtrees do appear in this classification, but they have no clear SCOP analogue.

FIG. 7.
The relative composition of subsequences (ω > 1) used as a classifier of SCOP levels. D₂ is considered only at the class level.

We remark that the ability of suffix graph measures to detect biologically meaningful groups of polypeptides is surprising, because the distance between two strings is computed by comparing their compositions, and not by counting patterns or substructures that occur in both. Recall also that strings have not been recoded with alphabets that incorporate biochemical information. It is therefore natural to ask how these trees compare to those derived from techniques that do exploit common features and biochemical properties: we expect the latter to strongly separate all strings in D₂ from D₁, and to reproduce the shape of SCOP more accurately.

Current measures of dissimilarity between biological strings can be grouped into essentially two families. Alignment-based measures (Apostolico and Giancarlo, 1998; Sankoff and Kruskal, 1999), popularized by the BLAST package, define similarity as a function of the local alignment score computed using the Smith-Waterman algorithm with a highly tuned penalty function. Even though alignment-based similarity notoriously suffers from methodological and technical limits when applied to whole genomes, it is ideally suited for searching proteins and short nucleotides in databases. We test the classification ability of alignment-based similarity by computing, for each pair of strings x,y in D₁ ∪ D₂, the distance d(x, y) = (1 − b(x, y)/b(x, x)) · (1 − b(y, x)/b(y, y)), where b(x, y) is the bit score of the alignment between x and y computed by blastp, a version of BLAST specifically designed for protein-protein comparison and using the BLOSUM62 amino acid similarity matrix.

Alignment-free techniques (Vinga and Almeida, 2003; Qi et al., 2004; Ulitsky et al., 2006; Apostolico and Denas, 2008; Sims et al., 2009; Apostolico et al., 2010) compare two strings using the frequency of common substructures (typically short substrings or subsequences), thereby implicitly projecting a string onto a vector that belongs to the space generated by the family of these substructures. The most ancient such technique projects a string into the space of all possible substrings of a fixed length k, and uses geometric or statistical measures to estimate the distance between two vectors. Since the effectiveness of these methods critically depends on k (Wu et al., 2005), it will not be examined here. Another notable subclass of alignment-free techniques resorts to mutual compressibility: the Normalized Compression Distance (NCD) (Cilibrasi and Vitányi, 2005; Kocsor et al., 2006; Ferragina et al., 2007), for example, approximates the Normalized Information Distance, a non-computable measure of dissimilarity hinged on Kolmogorov complexity that has been shown to be a lower bound to every possible distance measure between two strings (Li et al., 2003a). NCD summons a textual compressor on the concatenation of a pair of strings, thereby exploiting the features occurring in both that the compressor is designed to detect.

We test the classification ability of alignment-free distances by using NCD with the general-purpose Lempel-Ziv compressor gzip. A special purpose, highly-tuned protein compressor could approximate better the ideal compressor that should be used with NCD. Unfortunately, none of the protein compressors listed in Section 1 is both publicly available and mature enough for usage. Since many of them resort to exact and approximate reverse complements, we experiment with the popular gencompress (Chen et al., 2000), based on similar principles, even though we are aware that this program was designed for DNA compression.

The classification performance of the composition of points at ω ≤ 3 turns out to be comparable or superior to those of blastp, gzip and gencompress at the fold and superfamily levels (Fig. 8). At the family level, the composition of points has a performance comparable or superior to both gzip and gencompress, but inferior to blastp, while at the class level the composition of points is inferior to gzip and gencompress, but comparable to blastp. At ω ≥ 7 (Fig. 9) the composition of points is comparable to gzip and gencompress at all levels except class, but is inferior to blastp at all levels except fold and superfamily.

FIG. 8.
ROC curves for blastp (with default parameters), gzip –best and gencompress distances (gray dotted lines), compared to the ROC curves for the relative number of points at ω ≤ 3 (black solid lines). The blastp curve is truncated because low-scoring alignments are not outputted by the program. D₂ is considered only at the class level.

FIG. 9.
ROC curves for blastp (with default parameters), gzip –best and gencompress distances (gray dotted lines), compared to the ROC curves for the relative number of points at ω ≥ 7 (black solid lines). The blastp curve is truncated because low-scoring alignments are not outputted by the program. D₂ is considered only at the class level.

UPGMA trees built with both alignment-free distances sharply separate D₁ from D₂ (Fig. 10), while the blastp tree is not capable of grouping all members of D₂ under a single subtree. Both alignment-based and alignment-free trees contain macroscopic differences with respect to the SCOP hierarchy, putting a class or fold in the subtree associated with another. Furthermore, drawing branch lengths proportional to distance produces alignment-free trees that are largely devoid of structure (Fig. 10): most strings in D₁ are put at approximately the same distance from all other strings in D₁. In comparison, the blastp tree is more structured, but longer branches do not correspond to major SCOP divisions. Finally, no strong outlier can be detected in any tree.

FIG. 10.
UPGMA trees induced by (from left to right): blastp scores (with default parameters), the Normalized Compression Distance with compressor gzip –best, and the Normalized Compression Distance with compressor gencompress. First row: branches not drawn proportional to distance. Second row: branches drawn proportional to distance. NCD(x, y) = (min{C(xy), C(yx)} − min{C(x), C(y)})/max{C(x), C(y)}, where C is a normal string compressor.

6. Laws Governing Polypeptides

As stated in the introduction, our aim is to identify laws constraining the suffix graphs of polypeptides. In this section, we investigate the dependence of suffix graph measures on string length and on the the hiatus of subsequences.

6.1. Dependence on string length

The comparison of Figures 1 and 2 suggests that the number of special points taken relative to the total number of points is inversely proportional to string length. At ω = 1 this inverse proportionality comes not unexpected: the total number of points (in this case, distinct substrings) grows at most as the square of string length, and the number of special points (in this case, equivalent to the number of distinct special substrings) grows at most linearly with string length; therefore, the ratio (Special points/Total points) should behave like a/n + b, where n is string length and a, b are suitable constants, assuming the strings in the dataset to be approximately random. In principle, every string in D₁ ∪ D₂ could obey to a different set of parameters, making D₁ ∪ D₂ appear as a disordered cloud in the space generated by the (Special points/Total points) ratio and string length. However, in light of the homogeneities of Figures 1 and 2, and of the trees built in the previous section, we expect to see a limited number of distinct curves along which domains in similar SCOP groups are aligned. These curves (that we will also call loci) should be signatures of such groups, and their detection could guide classification.

Plotting the relative number of special points versus string length at ω = 1 (Fig. 11) shows indeed the expected 1/n proportionality, but surprisingly the majority of strings in D₁ ∪ D₂ are aligned along the same locus with coefficients⁶ a ≈ 1.435, b ≈ 0. Significantly, the only strong outlier is DisProt 34. There are three features that make DisProt 34 unique in the dataset: its highly repetitive structure, the use of just 3 distinct symbols, and its small entropy ( ≈ 1.215). The locus could therefore reflect a property of all strings that lack a strong periodic structure, that have a sufficiently high number of symbols, a sufficiently high entropy, or any combination of these three features. To test this hypothesis, we collect an additional dataset consisting of 89 distinct SCOP domains of length at most 30: we will refer to this set as D₃ in what follows. It turns out that D₃ contains at least two strings (tumor necrosis factor receptor superfamily member 17, BCMA, csqneyfdsllhacipcqlrc; nucleic acid binding protein p14, kgpvcfscgktghikrdckee) that lack a strong periodic structure, use a number of symbols comparable to strings in D₁ ∪ D₂ (14 and 13, respectively), have entropy comparable to strings in D₁ ∪ D₂ ( ≈ 3.5398 and ≈ 3.8643, respectively), but that do not lie on the locus. This proves that the locus cannot be explained by any combination of the candidate quantities alone. We also note that low entropy alone does not force a string to lie outside the locus: Figure 12 shows that random strings on 20 symbols and minimum entropy ( ≈ 1.1169, even lower than DisProt 34) can lie on the curve.

FIG. 11.
Relative number of special points versus string length, in domains (dots) and disordered regions (circles). Strings in D₃ are represented with gray crosses. The best interpolating a/n + b curve is shown as a black line.

FIG. 12.
The graph of Figure 11 at ω = 1. (Left) Some strings in D₃ (gray crosses) do not lie in the locus. (Right) Some random strings on 20 symbols and minimum entropy (gray crosses) enter the locus.

A locus with similar parameters persists at ω = 2, 3, with DisProt 34 and some strings in D₃ continuing to be outliers. Increasing ω beyond 3 gradually transforms the sharp locus into a dispersed cloud that keeps no resemblance to the original curve. At ω ≥ 6 three disordered regions (DisProt 34, 13, and 19) become clearly separated from the rest of the dataset, along with few strings in D₁.

We expect a direct proportionality between the number of special points y (not normalized) and string length n, and in particular a linear relationship y = an + b when ω = 1. Plotting these two quantities together (Fig. 13) shows indeed a linear bundle centered around a ≈ 0.317, b ≈ 0.318 for all domains and disordered regions except DisProt 34 and 25 (part of Fibronectin-binding protein A). The locus remains linear at ω = 2, but from ω = 3 it progressively tends towards a sparse nonlinear shape that we will call “horn.” We note that, at ω ≥ 6, this shape includes strings that were outliers at lower ω, in particular the highly regular DisProt 34. Strings in D₃ are never seen to escape the locus, but random strings on 20 symbols with minimum entropy turn out to be outliers for every ω, proving that the curve is not an unavoidable regularity of all strings.

FIG. 13.
Number of special points (not normalized) versus string length, in domains (dots) and disordered regions (circles). Gray crosses are random strings on 20 symbols with minimum entropy.

The total number of points assumes the expected quadratic shape at ω = 1, which persists up to ω = 4, then it gradually becomes a horn (Fig. 14). Neither D₃ nor DisProt 34 escape the locus, but few other disordered regions do. As before, it can be shown that there is at least one string that does not lie on the locus.

FIG. 14.
Total number of points versus string length, in domains (dots) and disordered regions (circles). The interpolating line is y = an² + bn + c. At ω = 1, a ≈ 0.4963. At ω = 2, a ≈ 0.5453. At ω = 3, a ≈ 0.5585. At ω = 4, a ≈ 0.6057.

Similar curves appear when other measures are considered: in all cases, at most one sharp curve appears, collecting the majority of D₁ ∪ D₂ with the possible exception of few outliers. Rather than analyzing each one of these curves in detail, we prefer to concentrate the attention on two measures in which the shape of the locus changes in a different way as a function of ω. In the relative number of antispecial points (Fig. 15) a sharp nonlinear curve of direct proportionality with few, strong outliers persists up to ω = 2, then it becomes disordered at ω = 3, 4, 5, and finally transitions towards a bundle of inverse proportionality at ω = 8. The second notable example is the relative number of normal points: no clear locus appears at ω ≤ 6, but a linear bundle starts to emerge at ω = 7, 8 (Fig. 16).

FIG. 15.
Relative number of antispecial points versus string length, in domains (dots) and disordered regions (circles). At ω = 2, 3, 4 some outliers fall outside the displayed range.

FIG. 16.
Relative number of normal points versus string length, in domains (dots) and disordered regions (circles). At ω = 2, 3 some outliers fall outside the displayed range.

These patterns of dependency of the relative number of special, antispecial and normal points on string length for specific values of ω largely explain the phase transitions seen in the trees of the previous section. At ω ≤ 3, the relative abundance of special and antispecial points is strictly controlled by string length, special and antispecial points are the most numerous types of point, and the tree reflects the three clusters induced by string length. At ω ≥ 7, antispecial and normal points are the most abundant type, their density is loosely influenced by length, and the trees show again a bi- or tripartition that reflects string length. When \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\omega \in [ 4 , 6 ]$$\end{document} , on the other hand, the relative abundance of neither special nor antispecial nor normal points depends on length, and the correspondence between trees and string length groups disappears.

Even though the trees of Figure 3 are largely a reflection of the groups induced by string length, we remark that this variable alone is not sufficient to explain some key features of the trees, like the strong separation of some members of D₂ from D₁ and of fold 1.8 from D₁.

The trees described in the previous section showed that the information encoded by the relative abundance of subsequences differs from the information encoded by the relative abundance of points. Plotting measures on subsequences versus string length reveals that what these measures are partially lacking is the dependence on string length itself. For example, the number of antispecial subsequences (not normalized) grows exponentially with string length, and their growth is confined inside a bundle whose width expands with length (Fig. 17). However, there is no correlation between the relative number of antispecial subsequences and string length when ω > 1 (Fig. 18). On the other hand, the relative number of normal subsequences depends on string length under a relationship of exponential inverse proportionality for all ω ≤ 5 (Figure 19); this locus disappears into a disordered cloud at ω ≥ 6.

FIG. 17.
Number of antispecial subsequences versus string length, in domains (dots) and disordered regions (circles).

FIG. 18.
Relative number of antispecial subsequences versus string length, in domains (dots) and disordered regions (circles). At ω > 1 few strings lie far from the main cloud, and fall outside the displayed range.

FIG. 19.
Relative number of normal subsequences versus string length, in domains (dots) and disordered regions (circles).

A measure can have no dependence on string length at a specific ω, but it could nonetheless obey other rules. We have just seen that the relative number of normal points has no locus at ω < 7, and that the relative number of normal subsequences has no locus at ω ≥ 6: plotting these two measures one versus the other shows again no correlation for D₁ ∪ D₂, but it reveals that D₃ is aligned along a horn at all values of ω > 1 (Fig. 20). No such regularity occurs, however, when we plot the relative number of antispecial points versus the relative number of antispecial subsequences.

FIG. 20.
Relative number of normal subsequences versus relative number of normal points, in D₃ (black crosses), D₁ (light gray dots), and D₂ (light gray circles).

6.2. Dependence on ω

In the previous section, we have shown that the shape of the curves relating suffix graph measures to string length changes with ω: it is therefore natural to investigate the dependence of these measures on ω. Displaying the relative number of special, antispecial, normal, and terminal points on the same graph reveals a recurrent motif: many strings pass through five phases, marked by the following events (Fig. 21): (1) the increase of special points above terminal points; (2) the increase of normal points above terminal points; (3) the overtaking of special points by normal points; (4) the final overtaking of antispecial points by normal points, after which normal points become the most abundant type in the suffix graph.

FIG. 21.
Dependence of the relative number of special, antispecial, normal and terminal points on ω. (Left) A member of family 1.1.1.1. (Center) A member of family 1.8.1.1. (Right) DisProt 34.

A view of the whole datasets (Fig. 22) shows that the values of ω at which each of these transitions occurs is not constant in D₁: while there is little variation for the value at which normal points overtake terminal and special points (always around 4,5), and at which special points overtake terminal points (always around 3,4) the value at which normal points overtake antispecial points varies significantly, and it does not respect SCOP boundaries. Superfamily 1.1.1 and fold 2.2 transition at ω ≥ 6, while fold 1.8 either presents no phase transition or transitions at ω = 8. In family 1.8.1.1 normal points never overcome antispecial points, and special and normal points increase above terminal points at relatively high ω. Other exceptions to the pattern occur in D₂ where, not surprisingly, DisProt 34 is a good example of abnormality (Fig. 21).

FIG. 22.
Values of ω at which phase transitions occur in D₁ ∪ D₂.

This extended heterogeneity prompts us to test whether the relationship between each suffix graph measure and ω is controlled by general rules, like those seen for string length (Fig. 23). It turns out that the relative number of special, antispecial, normal and terminal points trace wide sigmoid bundles in which all strings have similar shape but possibly very different values.⁷ These loci are not universal: apart from few obvious outliers in D₂ (DisProt 34 in special and normal points, DisProt 13 in special points, DisProt 9 in antispecial and terminal points, DisProt 20 in normal and terminal points), strings in D₃ turn out to follow very different rules (Fig. 23). Similar conclusions can be drawn for the absolute number of points and subsequences.

FIG. 23.
Relative number of points versus ω. (First row) D₁: folds 1.1–2.2 (black) compared to 1.8–2.1 (gray). (Second row) D₂ (black) compared to D₁ (gray). (Third row) D₃ (black) compared to D₁ (gray).

7. Laws Governing Random Permutations of Polypeptides

The alignment of most polypeptides along the same loci, and the presence of outliers to such loci, proves that the strings in D₁ ∪ D₂ are not random, but follow a specific compositional pattern, and that this pattern is shared by proteins lying in very different groups of the SCOP hierarchy. We expect the information that encodes these loci to be carried by the sequences of amino acids: in this case, randomly permuting the strings should destroy the signal and cause the loci to degenerate into random clouds. We analyze a set of 100 random permutations for each string in D₁ ∪ D₂. Surprisingly, it turns out that such permutations still trace the same loci in all graphs of the previous sections (Fig. 24): this proves that, in the dataset analyzed, it is the composition of symbols, not the sequence, to be responsible for the alignment of polypeptides along loci. However, the relationship between distribution of symbols and locus is not bijective: among the members of D₁, there are significant variations in how the frequencies of symbols are distributed (Fig. 25); therefore, a locus does not map onto similar distributions of symbols.

FIG. 24.
Relative number of special points versus string length in D₁ ∪ D₂ (light gray) and in a copy of D₁ ∪ D₂ in which each string has been randomly permuted (black).

FIG. 25.
Sorted frequencies of symbols in some strings of D₁. (Dotted line) Fumarate reductase, Escherichia coli. (Dashed line) CelD cellulase, N-terminal domain, Clostridium thermocellum. (Continuous line) Endo-1,4-beta xylanase D, xylan binding domain, XBD, Cellulomonas fimi.

On the other hand, sequence does influence suffix graph measures: in Figures 12 and 13, for example, random arrangements of a set of 20 symbols with minimum entropy assume a spectrum of different values. We conclude that, in the strings of D₁ ∪ D₂, the effect of sequence on suffix graph measures is weaker than the effect of symbol composition.

The influence of sequence on suffix graph measures deserves more attention. We project the 100 random permutations of each string in D₁ ∪ D₂ onto the space generated by every possible (x, y) pair, where x and y are suffix graph measures. Let's concentrate on the relationship between special points and total points first: the visual inspection of the graphs of few strings in D₁ shows that the set of random permutations forms a well-defined, linear bundle at ω = 1 and ω ≥ 6, while random clouds appear at 2 ≤ ω ≤ 5 (Fig. 26). Polypeptides are always seen to belong to these bundles.

FIG. 26.
Number of special points versus number of total points in Hemoglobin I from Scapharca inaequivalvis (a protein in D₁, circle) and in 100 random permutations of its sequence (dots). Lines indicate the best linear interpolation of the set of permutations. Other strings in D₁ produce similar shapes.

Probing the extent to which this relationship is supported by D₁, D₂, and D₃ is clearly unfeasible by drawing and analyzing each graph visually. Therefore, we measure the correlation coefficient between special and total points for each string: the correlation coefficient represents the strength of linear relationship between two variables as a value between −1 (strong linear negative relationship) and 1 (strong linear positive relationship). We conservatively consider “strong” a correlation that has absolute value ≥0.5 and p-value ≤10⁻⁵. The graph of the correlation coefficients for all three datasets (Fig. 27) shows that at ω = 1 the random permutations of all strings have a strong linear negative correlation, except for DisProt 34 and few members of D₁ and D₃. At ω = 2, 3, 4 the correlation progressively becomes weaker, except in DisProt 34 in which it is strong and positive for all ω > 1 (Fig. 28). At ω ≥ 5 correlation progressively becomes strong and positive for the majority of D₁ ∪ D₂, but it remains weak in most of D₃ (except, for example, thyroid receptor interacting protein 6, Homo sapiens), in some members of D₂ (e.g., DisProt 20; Fig. 31), and in fold 1.8 (e.g., in the C-terminal domain of γ, δ resolvase, Escherichia coli). In all cases in which most strings have a strong correlation, strings with a weak correation can be found, and vice versa, proving that the pattern of strong and weak correlations as a function of ω is not an unavoidable regularity of all polypeptides.

FIG. 27.
Correlation between the number of special and total points in the dataset. For clarity of presentation, p-values are not shown: they are ≤10⁻⁵ wherever the correlation is ≥0.5 in absolute value. Vertical dashed lines highlight fold 1.8 and D₂. γδR: C-terminal domain of γ, δ resolvase, Escherichia coli. TRIP6: thyroid receptor interacting protein 6, Homo sapiens.

FIG. 28.
Number of special points versus number of total points in DisProt 34.

The pattern of correlations neither reveals clear distinctions among SCOP groups, nor between polypeptides and their permutations: all strings in the datasets belong to the linear loci of their random permutations, with the exception of DisProt 28, 25 and 34 at ω = 1 (Figs. 28 –30). The coefficient a of the linear interpolations is approximately constant inside D₁ ∪ D₂ ∪ D₃ at ω = 1, and approximately constant in D₁ ∪ D₂ at ω ≥ 6 (Fig. 32). The coefficient b, on the other hand, oscillates widely inside D₁ ∪ D₂ (Fig. 33).

FIG. 29.
Number of special (left), antispecial (center), and normal (right) points versus number of total points in DisProt 25.

FIG. 30.
Number of special points (left) and antispecial points (right) versus number of total points in DisProt 28 (ω = 1).

FIG. 31.
Number of special points versus number of total points in DisProt 20.

FIG. 32.
Coefficient a of the linear interpolation across the dataset. The coefficient is computed on the measures divided by their respective maximum values. Vertical dashed lines highlight fold 1.8 and D₂.

FIG. 33.
Coefficient b of the linear interpolation across the dataset. The coefficient is computed on the measures divided by their respective maximum values. Vertical dashed lines highlight fold 1.8 and D₂.

Analyzing in detail the effect of sequence on each suffix graph measure is outside the scope of this article. Here we just observe that, except for few cases, normal points are not strongly correlated to total points at ω ≤ 4, but at ω ≥ 5 D₁ ∪ D₂ reaches a strong correlation, while the correlation of D₃ remains lower (Fig. 34). Antispecial points are highly correlated with total points at all values of ω; terminal points are not correlated to total points at ω = 1, but they become strongly correlated at ω = 2, 3, and finally their correlation stabilizes around a lower value at ω ≥ 5.

FIG. 34.
Correlation between the number of normal and total points in the datasets. Vertical dashed lines highlight fold 1.8 and D₂.

8. Conclusion

Some natural measures on the abundance of points, arcs, and subsequences in the suffix graphs of polypeptides appear to grasp structural/functional information. The values of these measures depend on string length and on the hiatus of subsequences under a specific set of rules, as shared by a spectrum of structurally and functionally diverse polypeptides. In accordance with the current consensus that sees proteins as random strings, these rules are influenced by the distributions of symbols more strongly than by their organization within the sequences. In most polypeptides, it is seen that even their random permutations amass along specific linear loci. Counterexamples show that none of the rules is an unavoidable property of all polypeptides or distributions of symbols, thereby suggesting that the shapes of these loci are specific signatures of the dataset or of its parts.

It is well known that amino acid composition varies with functional class and cellular localization (Karlin, 1992). The fact that most of the rules described in this paper are common to structurally and evolutionarily diverse polypeptides might suggest that they capture some organizational constraint that crosses the boundaries of protein families, and is implied in the chemical or spatial stability of amino acid chains (Han and Baker, 1995; Han et al., 1997), or in the mechanism by which secondary structures aggregate and connect to each other (Przytycka et al., 2002). Alternatively, these rules might capture evolutionary regularities, like properties of the limited number of peptides that have arguably constituted the primitive peptide world (White and Jacobs, 1990; Qi et al., 2004), or laws behind the assemblage of these original segments into modern domains (Lupas, 2001).

It is also well known that amino acid abundance is highly influenced by the structure of the genome and varies with species (Karlin, 1992). The regularities described in this paper could therefore reflect biases and optimizations in the translation machinery (Schachtel et al., 1991; White, 1994; Dufton; 1997 Wan and Wootton, 2000; Rost, 2002), or be the image of corresponding constraints in the genome.

From the experimental point of view, this work stimulates the analysis of the whole SCOP with the purpose of counting and mapping the repertoire of rules that occur therein. Studying what happens at higher values of ω would also be a natural extension. From the theoretical viewpoint, this work opens the problem of explaining the effect of the sequence and of the distribution of symbols of a string on the described rules. A related avenue consists in defining a complexity measure for strings and distributions hinged on these rules, and in comparing it to other state-vector complexity measures, like Shannon's entropy and the global compositional measure of Wan and Wootton (2000). The problems of computing such measures efficiently and of structurally comparing the suffix graphs of different strings lend themeselves to the formulation of interesting algorithmic extensions.

Family	ASTRAL	Name, species
1.1.1.3	d3sdha_ a.1.1.2 (A:)	Hemoglobin I, Scapharca inaequivalvis
1.1.1.3	d1b0ba_ a.1.1.2 (A:)	Hemoglobin I, Lucina pectinata
1.1.1.3	d1h97a_ a.1.1.2 (A:)	Trematode hemoglobin/myoglobin, Paramphistomum epiclitum
1.1.1.3	d1jl7a_ a.1.1.2 (A:)	Glycera globin, Glycera dibranchiata
1.1.1.3	d1a6ma_ a.1.1.2 (A:)	Myoglobin, Physeter catodon
1.1.1.3	d1mbaa_ a.1.1.2 (A:)	Myoglobin, Aplysia limacina
1.1.1.3	d1mbsa_ a.1.1.2 (A:)	Myoglobin, Phoca vitulina
1.1.1.3	d1ecoa_ a.1.1.2 (A:)	Erythrocruorin, Chironomus thummi thummi, fraction III
1.1.1.3	d2gdma_ a.1.1.2 (A:)	Leghemoglobin, Lupinus luteus
1.1.1.3	d1fsla_ a.1.1.2 (A:)	Leghemoglobin, Glycine max, isoform A
1.1.1.3	d1d8ua_ a.1.1.2 (A:)	Non-symbiotic plant hemoglobin, Oryza sativa
1.1.1.3	d1irda_ a.1.1.2 (A:)	Hemoglobin alpha-chain, Homo sapiens
1.1.1.3	d2d5xa1 a.1.1.2 (A:1-141)	Hemoglobin alpha-chain, Equus caballus
1.1.1.3	d1hdsa_ a.1.1.2 (A:)	Hemoglobin alpha-chain, Odocoileus virginianus
1.1.1.3	d1irdb_ a.1.1.2 (B:)	Hemoglobin beta-chain, Homo sapiens
1.1.1.3	d2d5xb1 a.1.1.2 (B:1-146)	Hemoglobin beta-chain, Equus caballus
1.1.1.3	d1hdsb_ a.1.1.2 (B:)	Hemoglobin beta-chain, Odocoileus virginianus
1.1.1.3	d1it2a_ a.1.1.2 (A:)	Hagfish hemoglobin, Eptatretus burgeri
1.1.1.3	d1phna_ a.1.1.3 (A:)	Phycocyanin alpha subunit, Cyanidium caldarium
1.1.1.4	d1f99a_ a.1.1.3 (A:)	Phycocyanin alpha subunit, Polysiphonia urceolata
1.1.1.4	d1cpca_ a.1.1.3 (A:)	Phycocyanin alpha subunit, Fremyella diplosiphon
1.1.1.4	d1phnb_ a.1.1.3 (B:)	Phycocyanin beta subunit, Cyanidium caldarium
1.1.1.4	d1f99b_ a.1.1.3 (B:)	Phycocyanin beta subunit, Polysiphonia urceolata
1.1.1.4	d1cpcb_ a.1.1.3 (B:)	Phycocyanin beta subunit, Fremyella diplosiphon
1.1.1.4	d1alla_ a.1.1.3 (A:)	Allophycocyanin alpha subunit, Spirulina platensis
1.1.1.4	d1b33a_ a.1.1.3 (A:)	Allophycocyanin alpha subunit, Mastigocladus laminosus
1.1.1.4	d1kn1a_ a.1.1.3 (A:)	Allophycocyanin alpha subunit, Porphyra yezoensis
1.1.1.4	d1allb_ a.1.1.3 (B:)	Allophycocyanin beta subunit, Spirulina platensis
1.1.1.4	d1b33b_ a.1.1.3 (B:)	Allophycocyanin beta subunit, Mastigocladus laminosus
1.1.1.4	d1kn1b_ a.1.1.3 (B:)	Allophycocyanin beta subunit, Porphyra yezoensis
1.1.1.4	d1liaa_ a.1.1.3 (A:)	Phycoerythrin alpha subunit, Polysiphonia urceolata
1.1.1.4	d1b8da_ a.1.1.3 (A:)	Phycoerythrin alpha subunit, Griffithsia monilis
1.1.1.4	d1eyxa_ a.1.1.3 (A:)	Phycoerythrin alpha subunit, Gracilaria chilensis
1.1.1.4	d1liab_ a.1.1.3 (B:)	Phycoerythrin beta subunit, Polysiphonia urceolata
1.1.1.4	d1b8db_ a.1.1.3 (B:)	Phycoerythrin beta subunit, Griffithsia monilis
1.1.1.4	d1eyxb_ a.1.1.3 (B:)	Phycoerythrin beta subunit, Gracilaria chilensis
1.1.2.1	d1nekb1 a.1.2.1 (B:107-238)	Succinate dehydogenase, Escherichia coli
1.1.2.1	d1kf6b1 a.1.2.1 (B:106-243)	Fumarate reductase, Escherichia coli
1.1.2.1	d2bs2b1 a.1.2.1 (B:107-239)	Fumarate reductase, Wolinella succinogenes
1.1.2.2	d1h7wa1 a.1.2.2 (A:2-183)	Dihydropyrimidine dehydrogenase, N-terminal domain, Sus scrofa
1.8.1.1	d1p7ia_ a.4.1.1 (A:)	Engrailed Homeodomain, Drosophila melanogaster
1.8.1.1	d1akha_ a.4.1.1 (A:)	Mating type protein A1 Homeodomain, Saccharomyces cerevisiae
1.8.1.1	d1k61a_ a.4.1.1 (A:)	mat alpha2 Homeodomain, Saccharomyces cerevisiae
1.8.1.1	d1lfba_ a.4.1.1 (A:)	Hepatocyte nuclear factor 1a (LFB1/HNF1), Rattus rattus
1.8.1.1	d1ic8a1 a.4.1.1 (A:201-276)	Hepatocyte nuclear factor 1a (LFB1/HNF1), Homo sapiens
1.8.1.1	d1e3oc1 a.4.1.1 (C:104-160)	Oct-1 POU Homeodomain, Homo sapiens
1.8.1.1	d1au7a1 a.4.1.1 (A:103-160)	Pit-1 POU homeodomain, Rattus norvegicus
1.8.1.1	d1ftta_ a.4.1.1 (A:)	Thyroid transcription factor 1 homeodomain, Rattus norvegicus
1.8.1.1	d1hdpa_ a.4.1.1 (A:)	Oct-2 POU Homeodomain, Homo sapiens
1.8.1.1	d1ocpa_ a.4.1.1 (A:)	Oct-3 POU Homeodomain, Mus musculus
1.8.1.1	d1b72a_ a.4.1.1 (A:)	Homeobox protein hox-b1, Homo sapiens
1.8.1.2	d1gdta1 a.4.1.2 (A:141-183)	gamma,delta resolvase, (C-terminal domain), Escherichia coli
1.8.1.2	d1tc3c_ a.4.1.2 (C:)	Transposase tc3a1-65, Caenorhabditis elegans
1.8.1.2	d2ezla_ a.4.1.2 (A:)	Ibeta subdomain of the mu end DNA-binding domain of phage mu transposase, Bacteriophage mu
1.8.1.2	d2ezia_ a.4.1.2 (A:)	Transposase, Bacteriophage mu
1.8.4.5	d1i5za1 a.4.5.4 (A:138-206)	Catabolite gene activator protein (CAP), C-terminal domain, Escherichia coli
1.8.4.5	d1ft9a1 a.4.5.4 (A:134-213)	CO-sensing protein CooA, C-terminal domain, Rhodospirillum rubrum
1.8.4.5	d2bgca1 a.4.5.4 (A:138-237)	Listeriolysin regulatory protein PrfA, C-terminal domain, Listeria monocytogenes
1.8.4.5	d2gaua1 a.4.5.4 (A:152-232)	Transcriptional regulator PG0396, C-terminal domain, Porphyromonas gingivalis
1.8.4.5	d1zyba1 a.4.5.4 (A:148-220)	Probable transcription regulator BT4300, C-terminal domain, Bacteroides thetaiotaomicron
1.8.4.5	d2coha1 a.4.5.4 (A:118-199)	Transcriptional regulator TTHA1359, C-terminal domain, Thermus thermophilus
1.8.4.6	d1i1ga1 a.4.5.32 (A:2-61)	LprA, Archaeon Pyrococcus furiosus
1.8.4.6	d2cyya1 a.4.5.32 (A:5-64)	Putative transcriptional regulator PH1519, Archaeon Pyrococcus horikoshii
1.8.4.6	d2cg4a1 a.4.5.32 (A:4-66)	Regulatory protein AsnC, Escherichia coli
1.8.4.6	d2cfxa1 a.4.5.32 (A:1-63)	Transcriptional regulator LrpC, Bacillus subtilis
2.1.1.1	d1bwwa_ b.1.1.1 (A:)	Immunoglobulin light chain kappa variable domain VL-kappa, Homo sapiens cluster 1
2.1.1.1	d1mjul1 b.1.1.1 (L:1-107)	Immunoglobulin light chain kappa variable domain VL-kappa, Mus musculus cluster 1.1
2.1.1.1	d1lk3l1 b.1.1.1 (L:1-106)	Immunoglobulin light chain kappa variable domain VL-kappa, Rattus norvegicus
2.1.1.1	d1pewa_ b.1.1.1 (A:)	Immunoglobulin light chain lambda variable domain VL-lambda, Homo sapiens cluster 1
2.1.1.1	d1mfal1 b.1.1.1 (L:1-111)	Immunoglobulin light chain lambda variable domain VL-lambda, Mus musculus
2.1.1.1	d1oaql_ b.1.1.1 (L:)	Immunoglobulin light chain lambda variable domain VL-lambda, Rattus norvegicus
2.1.1.1	d1rzfh1 b.1.1.1 (H:2-113)	Immunoglobulin heavy chain variable domain VH, Homo sapiens cluster 1
2.1.1.1	d1dlfh_ b.1.1.1 (H:)	Immunoglobulin heavy chain variable domain VH, Mus musculus cluster 1
2.1.1.1	d1oaqh_ b.1.1.1 (H:)	Immunoglobulin heavy chain variable domain VH, Rattus norvegicus
2.1.1.1	d1xfpa_ b.1.1.1 (A:)	Camelid IG heavy chain variable domain VHh, Camelus dromedarius
2.1.1.1	d1sjva_ b.1.1.1 (A:)	Camelid IG heavy chain variable domain VHh, Lama glama
2.1.1.1	d1h5ba_ b.1.1.1 (A:)	T-cell antigen receptor, Mus musculus alpha-chain
2.1.1.1	d1ogad1 b.1.1.1 (D:3-117)	T-cell antigen receptor, Homo sapiens alpha-chain
2.1.1.1	d1beca1 b.1.1.1 (A:3-117)	T-cell antigen receptor, Mus musculus beta-chain
2.1.1.1	d1i8lc_ b.1.1.1 (C:)	Immunoreceptor CTLA-4 (CD152) N-terminal fragment, Homo sapiens
2.1.1.1	d1dqta_ b.1.1.1 (A:)	Immunoreceptor CTLA-4 (CD152) N-terminal fragment, Mus musculus
2.1.1.1	d1neua_ b.1.1.1 (A:)	Myelin membrane adhesion molecule P0, Rattus norvegicus
2.1.1.1	d1pkoa_ b.1.1.1 (A:)	Myelin oligodendrocyte glycoprotein (MOG), Rattus norvegicus
2.1.1.1	d1eaja_ b.1.1.1 (A:)	Coxsackie virus and adenovirus receptor (Car), domain 1, Homo sapiens
2.1.1.1	d1qfoa_ b.1.1.1 (A:)	N-terminal domain of sialoadhesin, Mus musculus
2.1.1.2	d1a3rl2 b.1.1.2 (L:115-214)	Immunoglobulin light chain kappa constant domain CL-kappa, Mus musculus
2.1.1.2	d1lk3l2 b.1.1.2 (L:107-210)	Immunoglobulin light chain kappa constant domain CL-kappa, Rattus norvegicus
2.1.1.2	d1c5cl2 b.1.1.2 (L:108-214)	Immunoglobulin light chain kappa constant domain CL-kappa, Homo sapiens
2.1.1.2	d1mfbl2 b.1.1.2 (L:112-212)	Immunoglobulin light chain lambda constant domain CL-lambda, Mus musculus
2.1.1.2	d1q0xl2 b.1.1.2 (L:108-212)	Immunoglobulin light chain lambda constant domain CL-lambda, Homo sapiens
2.1.1.2	d1nfde2 b.1.1.2 (E:108-215)	Immunoglobulin light chain lambda constant domain CL-lambda, Cricetulus griseus
2.1.1.2	d1c5ch2 b.1.1.2 (H:114-230)	Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Homo sapiens
2.1.1.2	d1mjuh2 b.1.1.2 (H:114-230)	Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Mus musculus
2.1.1.2	d1lk3h2 b.1.1.2 (H:120-219)	Immunoglobulin heavy chain gamma constant domain 1, CH1-gamma, Rattus norvegicus
2.1.1.2	d2fbjh2 b.1.1.2 (H:119-220)	Immunoglobulin heavy chain alpha constant domain 1, CH1-alpha, Mus musculus
2.1.1.2	d1dn0b2 b.1.1.2 (B:121-225)	Immunoglobulin heavy chain mu constant domain 1, CH1-mu, Homo sapiens
2.1.1.2	d1l6xa1 b.1.1.2 (A:237-341)	Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Homo sapiens
2.1.1.2	d1igyb3 b.1.1.2 (B:236-361)	Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Mus musculus
2.1.1.2	d1i1ca1 b.1.1.2 (A:239-341)	Immunoglobulin heavy chain gamma constant domain 2, CH2-gamma, Rattus norvegicus
2.1.1.2	d1l6xa2 b.1.1.2 (A:342-443)	Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Homo sapiens
2.1.1.2	d1cqka_ b.1.1.2 (A:)	Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Mus musculus
2.1.1.2	d1i1ca2 b.1.1.2 (A:342-443)	Immunoglobulin heavy chain gamma constant domain 3, CH3-gamma, Rattus norvegicus
2.1.1.2	d1ow0a1 b.1.1.2 (A:242-342)	Immunoglobulin heavy chain alpha constant domain 2, CH2-alpha, Homo sapiens
2.1.1.2	d1ow0a2 b.1.1.2 (A:343-450)	Immunoglobulin heavy chain alpha constant domain 3, CH3-alpha, Homo sapiens
2.1.1.2	d1o0va1 b.1.1.2 (A:228-330)	Immunoglobulin heavy chain epsilon constant domain 2, CH2-epsilon, Homo sapiens
2.1.2.1	d1p7hl1 b.1.18.1 (L:576-678)	T-cell transcription factor NFAT1 (NFATC2), Homo sapiens
2.1.2.1	d1imhc1 b.1.18.1 (C:368-468)	T-cell transcription factor NFAT5 (TONEBP), Homo sapiens
2.1.2.1	d1u3ya_ b.1.18.1 (A:)	p50 subunit of NF-kappa B transcription factor, Homo sapiens
2.1.2.1	d1bfsa_ b.1.18.1 (A:)	p50 subunit of NF-kappa B transcription factor, Mus musculus
2.1.2.1	d1a3qa1 b.1.18.1 (A:227-327)	p52 subunit of NF-kappa B (NFKB), Homo sapiens
2.1.2.1	d1bfta_ b.1.18.1 (A:)	p65 subunit of NF-kappa B (NFKB) dimerization domain, Mus musculus
2.1.2.1	d1my7a_ b.1.18.1 (A:)	p65 subunit of NF-kappa B (NFKB) dimerization domain, Homo sapiens
2.1.2.1	d1gjia1 b.1.18.1 (A:182-281)	p65 subunit of NF-kappa B (NFKB) dimerization domain, Gallus gallus C-rel
2.1.2.1	d1ttua1 b.1.18.1 (A:542-660)	DNA-binding protein LAG-1 (CSL), Caenorhabditis elegans
2.1.2.1	d2cxka1 b.1.18.1 (A:872-953)	Calmodulin binding transcription activator 1, Homo sapiens
2.1.2.2	d1gofa1 b.1.18.2 (A:538-639)	Galactose oxidase C-terminal domain, Dactylium dendroides
2.1.2.2	d1k3ia1 b.1.18.2 (A:538-639)	Galactose oxidase C-terminal domain, Fusarium sp.
2.1.2.2	d1w8oa1 b.1.18.2 (A:403-505)	Sialidase “linker” domain, Micromonospora viridifaciens
2.1.2.2	d1clca2 b.1.18.2 (A:35-134)	CelD cellulase N-terminal domain, Clostridium thermocellum
2.1.2.2	d1ut9a2 b.1.18.2 (A:208-305)	Cellulose 1,4-beta-cellobiosidase CbhA precatalytic domain, Clostridium thermocellum
2.1.2.2	d1f1sa2 b.1.18.2 (A:171-248)	Hyaluronate lyase precatalytic domain, Streptococcus agalactiae
2.1.2.2	d1qbaa1 b.1.18.2 (A:781-885)	Bacterial chitobiase (N-acetyl-beta-glucoseaminidase) C-terminal domain, Serratia marcescens
2.1.2.2	d1kcla1 b.1.18.2 (A:496-581)	Cyclomaltodextrin glycanotransferase domain D, Bacillus circulans, different strains
2.1.2.2	d1cyga1 b.1.18.2 (A:492-574)	Cyclomaltodextrin glycanotransferase domain D, Bacillus stearothermophilus
2.1.2.2	d1pama1 b.1.18.2 (A:497-582)	Cyclomaltodextrin glycanotransferase domain D, Bacillus sp., strain 1011
2.1.2.2	d1ciua1 b.1.18.2 (A:496-578)	Cyclomaltodextrin glycanotransferase domain D, Thermoanaerobacterium thermosulfurigenes EM1
2.1.2.2	d1qhoa1 b.1.18.2 (A:496-576)	Five domain “maltogenic” alpha-amylase (glucan 1,4-alpha-maltohydrolase), domain D, Bacillus stearothermophilus
2.1.2.2	d1gvia1 b.1.18.2 (A:1-123)	Maltogenic amylase N-terminal domain N, Thermus sp.
2.1.2.2	d1ea9c1 b.1.18.2 (C:1-121)	Maltogenic amylase N-terminal domain N, Bacillus sp. cyclomaltodextrinase
2.1.2.2	d1ji1a1 b.1.18.2 (A:1-122)	Maltogenic amylase N-terminal domain N, Thermoactinomyces vulgaris TVAI
2.1.2.2	d1j0ha1 b.1.18.2 (A:1-123)	Neopullulanase N-terminal domain, Bacillus stearothermophilus
2.2.3.2	d1n67a1 b.2.3.4 (A:229-369)	Clumping factor A, Staphylococcus aureus
2.2.3.2	d1r17a1 b.2.3.4 (A:276-424)	Fibrinogen-binding adhesin SdrG, Staphylococcus epidermidis
2.2.3.3	d1uwfa1 b.2.3.2 (A:1-158)	Mannose-specific adhesin FimH, Escherichia coli
2.2.3.3	d1pdkb_ b.2.3.2 (B:)	PapK pilus subunit, Escherichia coli
2.2.3.3	d1n12a_ b.2.3.2 (A:)	PapE pilus subunit, Escherichia coli
2.2.3.3	d1p5vb_ b.2.3.2 (B:)	F1 capsule antigen Caf1, Yersinia pestis
2.2.3.3	d2co3a1 b.2.3.2 (A:10-142)	SafA pilus subunit, Salmonella typhimurium
2.2.2.1	d1exha_ b.2.2.1 (A:)	Exo-1,4-beta-D-glycanase (cellulase xylanase) cellulose-binding domain CBD, Cellulomonas fimi
2.2.2.1	d1xbda_ b.2.2.1 (A:)	Endo-1,4-beta xylanase D xylan binding domain XBD, Cellulomonas fimi
2.2.2.2	d1nbca_ b.2.2.2 (A:)	Cellusomal scaffolding protein A, scaffoldin, Clostridium thermocellum
2.2.2.2	d1g43a_ b.2.2.2 (A:)	Cellusomal scaffolding protein A, scaffoldin, Clostridium cellulolyticum
2.2.2.2	d1tf4a2 b.2.2.2 (A:461-605)	Endo/exocellulase:cellobiose E-4 C-terminal domain, Thermomonospora fusca
2.2.2.2	d1g87a2 b.2.2.2 (A:457-614)	Endo/exocellulase:cellobiose E-4 C-terminal domain, Clostridium cellulolyticum atcc 35319
2.2.2.2	d1aoha_ b.2.2.2 (A:)	Cohesin domain, Clostridium thermocellum, cellulosome, various modules
2.2.2.2	d1zv9a1 b.2.2.2 (A:3-173)	Cellulosomal scaffoldin adaptor protein B, ScaB, Acetivibrio cellulolyticus
2.2.2.2	d1tyja1 b.2.2.2 (A:2-171)	Cellulosomal scaffoldin ScaA, Bacteroides cellulosolvens
2.2.2.2	d2bm3a1 b.2.2.2 (A:5-166)	Scaffolding dockerin binding protein A SdbA, Clostridium thermocellum

DisProt	Containing polypeptide	Location
DP00001	60S acidic ribosomal protein P1-B	1-108
DP00002	60S acidic ribosomal protein P2-beta	1-110
DP00004	Cathelicidin antimicrobial peptide	134-170
DP00005	Antitermination protein N	1-107
DP00006	Cytochrome c	1-104
DP00009	Transcription initiation factor IIA small subunit	89-103
DP00012	Cystic fibrosis transmembrane conductance regulator	708-831
DP00013	Choriogonadotropin subunit beta	112-145
DP00015	cAMP-dependent protein kinase inhibitor alpha	1-75
DP00016	Cyclin-dependent kinase inhibitor 1	1-164
DP00018	Cyclin-dependent kinase inhibitor 1B	22-106
DP00019	Cytochrome b-c1 complex subunit Rieske, mitochondrial	1-45
DP00020	DNA-binding protein RAP1	482-512
DP00022	EMB-1 protein	1-92
DP00024	Protein E7	1-98
DP00025	Fibronectin-binding protein A	745-873
DP00027	Negative regulator of flagellin synthesis	1-97
DP00028	Eukaryotic translation initiation factor 4E-binding protein 1	1-118
DP00031	Glycine N-methyltransferase	1-40
DP00032	Glycyl-tRNA synthetase	91-158
DP00034	Attachment protein G3P	236-274
DP00035	Guanine nucleotide-binding protein G(i), alpha-1 subunit	1-31
DP00038	Non-histone chromosomal protein HMG-14	1-99

Footnotes

Acknowledgments

A.A. was supported in part by the Italian Ministry of University and Research (under the Bi-National Project FIRB RBIN04BYZ7), by the United States–Israel Binational Science Foundation (BSF) (grant 008217), and by the Research Program of Georgia Tech. F.C. was supported in part by the Research Program of Georgia Tech and by the A. Gini Foundation, Padova, Italy.

Disclosure Statement

No competing financial interests exist.

1

An exception to this universal rule of disorder is represented by strongly nonrandom polypeptides: about 25% of all amino acids in current databases are estimated to be in “low complexity,” highly regular regions, and about 34% of all proteins in current databases are estimated to contain at least one such low complexity region (Wootton, 1994). These segments are routinely searched for and masked before local alignment searches (Wise, 2001; Jiménez-Montaño, ).

2

For each domain, we use at most three homologues coming from different species. Our choice of branches at each level, and of domains in each class, is arbitrary. For reasons of practical efficiency, domains in D₁ have length of 40–200.

3

By “empirical entropy,” we mean the approximation of the entropy of the ergodic source that has generated each sequence using the observed frequency of amino acids.

4

Only the number of normal subsequences is approximately constant at ω = 2, but across the whole D₁.

5

When we write that a subtree corresponds to a SCOP group, we mean that the Jaccard coefficient between the subtree and the corresponding SCOP group is at least 0.3, and that there is no other subtree that achieves a significantly larger Jaccard coefficient when compared to the same SCOP group.

6

All the coefficients reported here are computed using the fit function of the Matlab curve fitting toolbox. A detailed investigation of the coefficients of such best interpolations is outside the scope of this paper.

7

There is a tendency for strings in folds 1.1-2.2 to assume a smaller proportion of antispecial and terminal points, and a larger proportion of special and normal points, compared to the rest of D₁. We choose not to investigate this detail in the present article.

References

Adami

, Cerf

2000. Physical complexity of symbolic sequences. Physica D, 137:62–69.

Adjeroh

, Nan

2006. On compressibility of protein sequences. Proc. Data Compression Conf., 422–434.

Anfinsen

1972. The formation and stabilization of protein structure. J. Biochem., 128:737–749.

Apostolico

, Comin

, Parida

2006. Mining, compressing and classifying with extensible motifs. Algorithms Mol. Biol., 1:4.

Apostolico

, Denas

2008. Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol., 3:13.

Apostolico

, Denas

, Dress

2010. Efficient tools for comparative substring analysis (submitted).

Apostolico

, Giancarlo

1998. Sequence alignment in molecular biology. J. Comput. Biol., 5:173–196.

Benedetto

, Caglioti

, Chica

2007. Compressing proteomes: the relevance of medium range correlations. EURASIP J. Bioinform. Syst. Biol., 60723.

Benson

, Waterman

1994. A method for fast database search for all k-nucleotide repeats. Nucleic Acids Res., 22:4828–4836.

10.

Brooks

F.P.

Jr.

2003. Three great challenges for half-century-old computer science. J. ACM, 50:25–26.

11.

Broome

, Hecht

2000. Nature disfavors sequences of alternating polar and nonpolar amino acids. J. Mol. Biol., 296:961–968.

12.

Cao

, Dix

, Allison

et al. 2007. A simple statistical algorithm for biological sequence compression. Proc. 2007 Data Compression Conf., 43–52.

13.

Carothers

J.M.

, Oestreich

S.C.

, Davis

J.H.

et al. 2004. Informational complexity and functional activity of RNA structures. J. Am. Chem. Soc., 126:5130–5137.

14.

Chandonia

, Hon

, Walker

et al. 2004. The ASTRAL compendium in 2004. Nucleic Acids Res., 32:D189–D192.

15.

Chattaraj

, Parida

2005. Varun: an inexact-suffix tree based algorithm for detecting extensible patterns. Theor. Comput. Sci., 335.

16.

Chen

, Kwong

, Li

2000. A compression algorithm for DNA sequences and its applications in genome comparison. Proc. 4th Annu. Int. Conf. Comput. Mol. Biol., 107.

17.

Cilibrasi

, Vitányi

2005. Clustering by compression. IEEE T. Inform. Theory, 51:1523–1545.

18.

Colosimo

, De Luca

2000. Special factors in biological strings. J. Theor. Biol., 58:29–46.

19.

Davidson

, Lumb

, Sauer

1995. Cooperatively folded proteins in random sequence libraries. Nat. Struct. Biol., 2:856–864.

20.

Davidson

, Sauer

1994. Folded proteins occur frequently in libraries of random amino acid sequences. Proc. Natl. Acad. Sci. USA, 91:2146–2150.

21.

Doi

, Kakukawa

, Oishi

et al. 2005. High solubility of random-sequence proteins consisting of five kinds of primitive amino acids. Protein Eng., 18:279–284.

22.

Dufton

1997. Genetic code synonym quotas and amino acid complexity: cutting the cost of proteins? J. Theor. Biol., 187:165–173.

23.

Ferragina

, Giancarlo

, Greco

et al. 2007. Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinform., 8:252.

24.

Galas

, Nykter

, Carter

et al. 2008. Set-based complexity and biological informationarXiv:0801.4024v1.

25.

Giancarlo

, Scaturro

, Utro

2009. Textual data compression in computational biology: a synopsis. Bioinformatics, 25:1575–1586.

26.

Gratzer

1998. General Lattice Theory, 2nd. Birkhauser: Berlin.

27.

Han

, Baker

1995. Recurring local sequence motifs in proteins. J. Mol. Biol., 251:176–187.

28.

Han

, Bystroff

, Baker

1997. Three-dimensional structures and contexts associated with recurrent amino acid sequence patterns. Protein Sci., 6:1587–1590.

29.

Hategan

, Tabus

2004. Protein is compressible. Proc. 6th Nordic Signal Process. Symp., 192–195.

30.

Hulo

, Bairoch

, Bulliard

et al. 2008. The 20 years of PROSITE. Nucleic Acids Res., 36:D245–D249.

31.

Jiménez-Montaño

1984. On the syntactic structure of protein sequences and the concept of grammar complexity. B. Math. Biol., 46:641–659.

32.

Karlin

1992. Quantile distributions of amino acid usage in protein classes. Protein Eng., 5:729–738.

33.

Kawashima

, Pokarowski

, Pokarowska

et al. 2008. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res., 36:D202–D205.

34.

Kocsor

, Kertész-Farkas

, Kaján

et al. 2006. Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics, 22:407–412.

35.

Lapinsh

, Gutcaits

, Prusis

et al. 2002. Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci., 11:795–805.

36.

, Chen

, Li

et al. 2003a. The similarity metric. Proc. 14th Annu. ACM-SIAM Symp. Discrete Algorithms, 863–872.

37.

, Fan

, Wang

et al. 2003b. Reduction of protein sequence complexity by residue grouping. Protein Eng., 16:323–330.

38.

, Obradovic

, Brown

et al. 2000. Comparing predictors of disordered protein. Genome Inform. Ser. Workshop Genome Inform., 11:172–184.

39.

Liao

, Yeh

, Chiang

et al. 2005. Protein sequence entropy is closely related to packing density and hydrophobicity. Protein Eng., 18:59–64.

40.

Lupas

2001. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol., 134:191–203.

41.

Macchiato

, Cuomo

, Tramontano

1985. Determination of the autocorrelation orders of proteins. Eur. J. Biochem., 149:375–379.

42.

Matsumoto

, Sadakane

, Imai

2000. Biological sequence compression algorithms. Genome Inform. Ser., 11:43–52.

43.

Monod

1972. Chance and Necessity. Collins: London.

44.

Murzin

, Brenner

, Hubbard

et al. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–540.

45.

Nakai

, Kidera

, Kanehisa

1988. Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng., 2:93–100.

46.

Nevill-Manning

, Witten

1999. Protein is incompressible. Proc. Conf. Data Compress., 257.

47.

Orengo

, Michie

, Jones

et al. 1997. CATH: a hierarchic classification of protein domain structures. Structure, 5:1093–1109.

48.

Pande

, Grosberg

, Tanaka

1994. Nonrandomness in protein sequences: evidence for a physically driven stage of evolution. Proc. Natl. Acad. Sci. USA, 91:12972–12975.

49.

Przytycka

, Srinivasan

, Rose

2002. Recursive domains in proteins. Protein Sci., 11:409–417.

50.

Ptitsyn

, Volkenstein

1986. Protein structure and neutral theory of evolution. J. Biomol. Struct. Dyn., 4:137–156.

51.

, Wang

, Hao

2004. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol., 58:1–11.

52.

Rackovsky

1998. Hidden sequence periodicities and protein architecture. Proc. Natl. Acad. Sci. USA, 95:8580–8584.

53.

Rahman

, Rackovsky

1995. Protein sequence randomness and sequence-structure correlations. Biophys. J., 68:1531–1539.

54.

Richardson

1981. The anatomy and taxonomy of protein structure. Adv. Protein Chem., 34:167–339.

55.

Romero

, Obradovic

, Dunker

1999. Folding minimal sequences: the lower bound for sequence complexity of globular proteins. FEBS Lett., 462:363–367.

56.

Romero

, Obradovic

, Li

et al. 2000. Sequence complexity of disordered protein. Proteins, 42:38–48.

57.

Rost

2002. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol., 12:409–416.

58.

Sampath

2003. A block coding method that leads to significantly lower entropy values for the proteins and coding sections of haemophilus influenzae. Proc. IEEE Comput. Soc. Conf. Bioinform., 287.

59.

Sankoff

, Kruskal

1999. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. CSLI Publications.

60.

Schachtel

, Bucher

, Mocarski

et al. 1991. Evidence for selective evolution in codon usage in conserved amino acid segments of human alphaherpesvirus proteins. J. Mol. Evol., 33:483–494.

61.

Schwartz

, King

2006. Frequencies of hydrophobic and hydrophilic runs and alternations in proteins of known structure. Protein Sci., 15:102–112.

62.

Sickmeier

, Hamilton

, LeGall

et al. 2007. DisProt: the database of disordered proteins. Nucleic Acids Res., 35:D786–D793.

63.

Sims

, Juna

, Wu

et al. 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA, 106:2677–2682.

64.

Strait

, Dewey

1996. The Shannon information entropy of protein sequences. Biophys. J., 71:148–155.

65.

Szpankowski

, Konorski

2007. What is information? Zeszyty Politechniki Gdanskiej, 5:171–177.

66.

Ulitsky

, Burstein

, Tuller

et al. 2006. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol., 13:336–350.

67.

Vinga

, Almeida

2003. Alignment-free sequence comparison: a review. Bioinformatics, 19:513–523.

68.

Wan

, Wootton

2000. A global compositional complexity measure for biological sequences. Comput. Chem., 24:71–94.

69.

Weathers

, Paulaitis

, Woolf

et al. 2006. Insights into protein structure and function from disorder-complexity space. Proteins, 66:16–28.

70.

Weiss

, Herzel

1998. Correlations in protein sequences and property codes. J. Theor. Biol., 190:341–353.

71.

Weiss

, Jiménez-Montaño

, Herzel

2000. Information content of protein sequences. J. Theor. Biol., 206:379–386.

72.

White

1994. Global statistics of protein sequences: implications for the origin evolution and prediction of structure. Annu. Rev. Biophys. Biomol. Struct., 23:407–439.

73.

White

, Jacobs

1990. Statistical distribution of hydrophobic residues along the length of protein chains. Biophys. J., 57:911–921.

74.

Wise

2001. 0j.py: a software tool for low complexity proteins and protein domains. Bioinformatics, 17:288–295.

75.

Wootton

1994. Sequences with unusual amino acid compositions. Curr. Opin. Struct. Biol., 4:413–421.

76.

Wright

, Dyson

1999. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol., 293:321–331.

77.

, Huang

, Li

2005. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21:4125–4132.

78.

Xiao

, Shao

, Ding

et al. 2005. Using complexity measure factor to predict protein subcellular location. Amino Acids, 28:57–61.

The Subsequence Composition of Polypeptides

Abstract

Abstract

1. Introduction

2. Preliminaries

Definition 1 (Left equivalence)

Property 1

Property 2

Definition 2 (Implication)

Definition 3 (Equivalence)

Lemma 1

Lemma 2

3. Special Subsequences and The ω-Suffix Space

Definition 4 (Special subsequence)

Property 3

Property 4

Property 5

Definition 5 (Antispecial subsequence)

Lemma 3

Definition 6 (Subspace of Ψ a (s))

4. The Dataset and Its Rationale

5. Classifying with Suffix Graphs

6. Laws Governing Polypeptides

6.1. Dependence on string length

6.2. Dependence on ω

7. Laws Governing Random Permutations of Polypeptides

8. Conclusion

Footnotes

Acknowledgments

Disclosure Statement

1

2

3

4

5

6

7

References

Definition 6 (Subspace of Ψ_a(s))