An extension of fuzzy topological approach for comparison of genetic sequences

Abstract

Bioinformatics is a relatively new discipline where Mathematics are applied in the analysis of genetic sequences. The analysis of the genetic material of living organisms which consist of nucleic acids DNA and RNA is of great importance for diagnosis and taxonomy reasons. In the present paper we propose a new methodology for the representation of genetic sequences as fuzzy sets in the I ¹² space which can significantly improve the results of Sadegh-Zadeh and Torres & Nieto. An important characteristic of our proposed methodology is that the location of Amino acids along the genetic sequences play an important role thus extending in a significant way the computational efficiency advantage of genetic sequence representation. We present some characteristic examples using the new proposed methodology where we calculate the distance and similarity degree of given polynucleotides.

Keywords

DNA RNA polynucleotide fuzzy sets multi-fuzzy sets metric spaces

1 Introduction

The study of genetic sequences is of great importance in biology and medicine with sequence analysis and taxonomy being two major fields of applications of bioinformatics. In this course there are two basic strategies that are commonly employed: a) sequence analysis, i.e. determination of the building blocks of a nucleic acid (nucleotides) and their order in the molecular chain, and b) sequence comparison used to identify the degree of difference/similarity between polynucleotide, e.g. in order to identify similarity with known viruses.

DNA and RNA are made of triplets XYZ of codons each of them having the possibility to be one of four nucleotides {U, C, A, G} in the case of DNA and {T, C, A, G} in the case of RNA (A=Adenine, C=Cytosine, G=Guanine, T=Thymine, U=Uracil). DNA sequencing methodologies are important for biology and medicine in a broad range of applications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA and RNA are in general very lengthy chains and with the rapid increasing number of genome sequences DNA sequencing technologies should be fast and accurate [1]. A sequence similarity search often provides the first information about a new DNA or protein sequence. A search allows scientists to infer the function of a sequence from similar sequences [33]. Several applications concern prediction of protein DNA-binding sites [85] since interactions between proteins and DNA play an important role in a large number of biological processes such as DNA replication, splicing, and repair. Identification of amino acid residues involved in DNA-binding sites is important for understanding mechanisms of biological activities. Other applications concern bacterial sequencing [40] and MicroRNAs (miRNAs) which are small, non-coding, endogenous RNA molecules that play important roles in a variety of normal and diseased biological processes [86] as well as in pharmacogenetics [88]. There are several methodologies for sequencing analysis, such as covariance discriminant [6], neural networks [30], support vector machine [31], random forest [38], nearest neighbor [12], K-nearest neighbor [13], and Fuzzy K-nearest neighbor [82] just to mentionfew ones.

The fuzzy methods have the advantage to incorporate the uncertainties that exists for the data in the model ([54], [79]). Sadegh-Zadeh [66] demonstrated that nucleic acids (DNA and RNA) can be treated as ordered fuzzy sets in a 12-dimensional space. In this frame the genetic code can be represented in a 12-dimensional space because a triplet codon XYZ has a 3 × 4 =12 dimensional fuzzy code (a ₁, . . . , a ₁₂) corresponding to a point in the 12-dimensional fuzzy polynucleotide space [0, 1] ¹². Sadegh-Zadeh [66] and Nieto et al. (see [78], [53] and [54]) introduced the Fuzzy Polynucleotide Space (FPS) based on the principle of the fuzzy hypercube [39]. A polynucleotide consisting of a sequence of k triplets XYZ is a point in a I ^12×k space, where I = [0, 1]. However, in [78] the authors mapped a polynucleotide on a I ¹² space by considering the frequencies of appearance of the nucleotides at the three base sites of a codon in the coding sequence. In that work using a metric motivated by publications [44] and [66], the authors calculated distances between nucleotides as well as they applied their algorithm for the comparison of complete genomes (such as M.tuberculosis and E.coli). Further work has been recently performed using the same principle (see [55]) in which the influence of several metrics have been examined in the procedure of comparison of nucleotides. The advantages of this methodology which in fact reduces the information tobookmark are:

a) one can compare polynucleotides of very large length in a very computationally efficientway, since the whole information of location of codons is transformed to frequencies of presence in a 12-dimensional space, and

b) one can apply the algorithm in order to compare polynucleotides of different length as it is the case for genomes of different organisms since all chains are reduced to a 12-dimensional space.

Another very interesting approach of representing polynucleotides for comparison reasons is the concept of pseudo amino acid (pseAA) composition which was introduced by Chou (see [8]). In this approach a significant effort has been performed in order to use various digital numbers aiming to represent the 20 amino acid in order to better reflect the sequence-order effects taking into account physical properties. The pseAA composition was originally introduced to improve the prediction quality for protein subcellular localization and membrane protein type ([8]), as well as for enzyme functional class ([9]). This concept can be employed in the representation of a protein sequence with a discrete model without however losing completely its sequence-order information ([14],15). Thus it is particularly useful for the analysis of a large amount of complicated protein sequences by means of the taxonomic approach. This methodology has been widely used to study various protein attributes, such as protein structural class ([2, 3], [81], [45], [22]) protein subcellular localization ([19], [14, 15], [74]), protein subnuclear localization ([70, 71], [52]) protein submitochondrial localization ([25]) protein oligomer type ([10]) conotoxin super-family classification ([49], [46]) membrane protein type ([47], [70], [71], [80], [73], [16], [17], [7]) apo ptosis protein subcellular localization ([4], [5], [58]) enzyme functional classification ([9], [11], [87], [75]) proteinfold pattern ([72]) and signal peptide ([18], [76]) predict mycobacterial proteins subcellular locations ([29]). Some more recent work based on this principle concern ([59], [58], [26], [83], [68]). We point that the metrics employed to calculate the difference/similarity play an important role on computational biology. Different metrics have been used to study secondary structures (see [51]) or biopolymer contact structures (see [43]).

The approach of Nieto et al. mentioned above, though very interesting from a computational point of view as mentioned already, it presents some drawbacks since it does not take into account the detailed location of the Amino acids along the polynucleotide chain. However, a vector defined in a discrete model may completely lose all the sequence-order information. Thus it cannot distinguish two polynucleotides having the same frequencies of appearance of the amino acids but located at different locations along the polynucleotide chain. However, it is very important to be in a position to determine how close two genetic sequences are since there are many important biological and medical implications (see [20], [32], [34], [37], [41]and [42]).

In the present paper we introduce a new methodology for the representation of genetic sequences in the I ¹² space where the location of the amino acids along the genome sequence is taken into account along with the frequency of presence and thus it is more effective in distinguishing polynucleotides with the same number of amino acids located however at different positions. In that aim we present some characteristic examples of comparison of the results obtained in [66] and [78] with that obtained using the methodology we propose in the present paper.

The structure of the paper is as follows. In Section 2, we present some known notions for fuzzy sets and fuzzy hypercube. In Section 3, we give a new representation of genetic sequences in I ¹² and compare the results with the results of [66] and [78]. Finally in Section 4 the conclusions of the present work are summarized.

2 Fuzzy sets and fuzzy hypercube

2.1 Fuzzy sets

Let X be a set. A is a fuzzy subset of X if A = {(x, μ _A (x)) : x ∈ X}, where μ _A is a function of X into [0, 1] = I, that is A is the set of all pairs (x, μ _A (x)) such that x ∈ X and μ _A (x) is the degree of its membership in A.

In what follows if X = {x ₁, x ₂, . . . , x _n} and $A = {(x_{1}, μ_{A} (x_{1})), . . ., (x_{n}, μ_{A} (x_{n}))},$ then we write $A = (μ_{A} (x_{1}), . . ., μ_{A} (x_{n})) .$

For a given set X = {x ₁, x ₂, . . . , x _n}, the set of all fuzzy subsets (of X) is precisely the unit hypercube I ⁿ = [0, 1] ⁿ, since any fuzzy subset A determines a point P ∈ I ⁿ given by P = (μ _A (x ₁) , . . . , μ _A (x _n)) (see [39]).

Also, any point P = (a ₁, . . . , a _n) ∈ I ⁿ generates a fuzzy subset A of X defined by the map μ _A : X → [0, 1] such that μ _A (x _i) = a _i, i = 1, 2, . . . , n.

Nonfuzzy or crisp subsets of X = {x ₁, . . . , x _n} are given by mappings: $μ : X \to {0, 1}$ from the set X into the set {0, 1} and they are located at the 2ⁿ corners of the n-dimensional unit hypercube I ⁿ. So, the ground set X = {x ₁, . . . , x _n} is itself the fuzzy set (1, 1, . . . , 1) ∈ I ⁿ. Also, the empty fuzzy set is the fuzzy set (0, 0, . . . , 0) ∈ I ⁿ, denoted by ∅.

Hypercubical calculus is developed in [84], and some applications of the fuzzy unit hypercube are given in [54], [65] and [35]. In this context a codon corresponds to a corner of the 12-dimensional unit hypercube I ¹². Any element of I ¹² may be viewed as a fuzzy codon.

DNA and RNA can be treated as a language written using an alphabet of strings. The role of strings is played by several chemical compounds. In fact the alphabet for DNA is {T, C, A, G} while for RNA {U, C, A, G} where A,C,G,T and U stand for Adenine, Cytosine, Guanine, Thymine and Uracil respectively. In this context in the case of RNA alphabet if U is the first letter of this alphabet one codes it as (1, 0, 0, 0), that is a four dimensional multi-fuzzy membership value (see, [60], [61], [62], [63], and [64]), 1 because the first letter U is present, 0 since the second letter does not appear, 0 since the third letter is not present and 0 since the fourth letter G does not appear. In a similar way C is represented as (0, 1, 0, 0), A as (0, 0, 1, 0) and G as (0, 0, 0, 1). So if we have a nucleotide described by the codon UCG (serine) this would be written in the I ¹² hypercube as: $(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1) .$ There are cases where the exact chemical structure of the sequence is not known for the complete sequence. In this case some components of its fuzzy code being neither 0 or 1 but a value in the interval (0, 1).

2.2 NTV metric and similarity

Consider the n-dimensional unit hypercube I ⁿ. If $p = (p_{1}, . . ., p_{n}), q = (q_{1}, . . ., q_{n}) \in I^{n}$ are two different fuzzy polynucleotides, then we consider the following distance between the elements p and q: $d (p, q) = \frac{\sum_{i = 1}^{n} | p_{i} - q_{i} |}{\sum_{i = 1}^{n} max {p_{i}, q_{i}}}$ (1) Also, if p = q = ∅ = (0, . . . , 0), then d (∅ , ∅) =0(see [53]).

The distance d is motivated by publications of [44] and [66]. We know that d is a metric [53] and has already been employed in [78] and [55]. In [23] (see, also [24]) it is proposed to call this metric as the NTV metric.

Let X = {x ₁, . . . , x _n} be a set and A = (a ₁, . . . , a _n) , B = (b ₁, . . . , b _n), where a _i, b _i ∈ [0, 1], two fuzzy sets of X. The degree of similarity between A and B (see [55]), denoted by sim (A, B), is defined as: If A and B not equal to (0, . . . , 0) simultaneously, then $sim (A, B) = \frac{min {a_{1}, b_{1}} + . . . + min {a_{n}, b_{n}}}{\frac{a_{1} + b_{1}}{2} + . . . + \frac{a_{n} + b_{n}}{2}}$ (2) and if A = B = (0, . . . , 0), then sim (A, B) =1.

3 A new representation of genetic sequences in I¹²

Let r = X ₁ Y ₁ Z ₁ . . . X _k Y _k Z _k, where X _i, Y _i, Z _i ∈ {U, C, A, G} be a genetic sequence with k-triplets. Hence the sequence has 3k letters (nucleotides).

Let W ∈ {U, C, A, G}, i = 1, 2, 3, and j ∈ {1, . . . , k}. Then, by $s_{ij}^{W} \in {0, 1}$ we denote the number which is defined as follows: $s_{ij}^{W} = 1$ if W appears at the i-coordinate of j-triplet, otherwise $s_{ij}^{W} = 0$ . Also, by $a_{i}^{W}$ we denote the following number: $a_{i}^{W} = \sum_{j = 1}^{k} s_{ij}^{W} .$ This number corresponds to the number of nucleotides W at base site i.

3.1 Nieto and Tores representation of genetic sequences in I¹²

Following the methodology of [78] we calculate the frequencies (fractions) of the nucleotide at the three base sites in order to obtain their fuzzy representation in the I ¹² hyperspace.

To compute the fractions $f_{i}^{W}$ of nucleotides at the three base sites of triplet we divide $a_{i}^{W}$ by the number k of triplets. So, we have: $f_{i}^{W} = \frac{a_{i}^{W}}{k},$ for i = 1, 2, 3 and W ∈ {U, C, A, G}.

Finally, they write the genetic sequence r as a point in the hypercube I ¹² as follows: $r^{'} = (f_{1}^{U}, f_{2}^{U}, f_{3}^{U}, f_{1}^{C}, f_{2}^{C}, f_{3}^{C}, f_{1}^{A}, f_{2}^{A}, f_{3}^{A}, f_{1}^{G}, f_{2}^{G}, f_{3}^{G}) .$

Example 1. Suppose we have the polynucleotide described by the sequence $r_{1} = UAC UGU (tyrosine / cysteine)$ Then, we have k = 2 and respectively

$a_{1}^{U} = s_{11}^{U} + s_{12}^{U} = 2$ since U appears 2 times at the first position of the codons of the sequence,

$a_{2}^{U} = s_{21}^{U} + s_{22}^{U} = 0$ since U does not appear at any second position of the codons of the sequence,

and $a_{3}^{U} = s_{31}^{U} + s_{32}^{U} = 1$ since the U appears 1 time at the third position of the codons of the sequence.

In the same way we obtain $a_{1}^{C} = 0$ , $a_{2}^{C} = 0$ , $a_{3}^{C} = 1$ , $a_{1}^{A} = 0$ , $a_{2}^{A} = 1$ , $a_{3}^{A} = 0$ , $a_{1}^{G} = 0$ , $a_{2}^{G} = 1$ , $a_{3}^{G} = 0$ (see Table 1).

Following the methodology of Nieto and Tores we calculate the frequencies (fractions) of the nucleotide at the three base sites in order to obtain their fuzzy representation in the I ¹² hyperspace. So, we have: $f_{1}^{U} = 1$ , $f_{2}^{U} = 0$ , $f_{3}^{U} = 0.5$ , $f_{1}^{C} = 0$ , $f_{2}^{C} = 0$ , $f_{3}^{C} = 0.5,$ $f_{1}^{A} = 0$ , $f_{2}^{A} = 0.5$ , $f_{3}^{A} = 0,$ $f_{1}^{G} = 0$ , $f_{2}^{G} = 0.5$ , and $f_{3}^{G} = 0$ (see Table 2).

As a result sequence r ₁ would be written in the I ¹² space as $r_{1}^{'} = (1, 0, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0.5, 0, 0) .$

Remark. If we consider the sequence: $UGU UAC (cysteine / tyrosine)$ we obtain the same results as in Tables 1 and 2 above, while the sequences in fact present different codon locations. In order to distinguish the both sequences, UGU UAC from UAC UGU, we incorporate the influence of the order in the our new representation of genetic sequence in I ¹².

3.2 A new representation of genetic sequences in I¹²

Let r = X ₁ Y ₁ Z ₁ . . . X _k Y _k Z _k, where X _i, Y _i, Z _i ∈ {U, C, A, G} be a genetic sequence, W ∈ {U, C, A, G}, i = 1, 2, 3, and j ∈ {1, . . . , k}. In the new representation of genetic sequence in I ¹² we follow two steps described below.

Step 1. We calculate the number $a_{i}^{W}$ of nucleotides at the three base sites of a codon in the sequence r, the sum $\sum_{j = 1}^{k} j \cdot s_{ij}^{W}$ of locations of triplets which contains the nucleotides in the sequence r.

Step 2. The number $φ_{i}^{W} = \frac{a_{i}^{W}}{\sum_{j = 1}^{k} j \cdot s_{ij}^{W}}$ are obtained for i = 1, 2, 3 and W ∈ {U, C, A, G}. Of course, if $s_{ij}^{W} = 0$ for every j = 1, 2, . . . , k i.e. there is no W at position i, then we set $φ_{i}^{W} = 0$ .

As a result sequence r would be written in the I ¹² hypercube following the new methodology as $r^{''} = (φ_{1}^{U}, φ_{2}^{U}, φ_{3}^{U}, φ_{1}^{C}, φ_{2}^{C}, φ_{3}^{C}, φ_{1}^{A}, φ_{2}^{A}, φ_{3}^{A}, φ_{1}^{G}, φ_{2}^{G}, φ_{3}^{G}) .$ In this way we take into account the position of the nucleotides.

In the following we present some applications of the new methodology calculating the distance of selected the polynucleotide and their degree of similarity and we compare the results with that obtained using the method of Torres & Nieto in order to stress the advantages of the new method. We note that a single prime corresponds to the NTV representation and a double representation to the new representation.

Example 2. Suppose we have the polynucleotide described by the sequence (mentioned in Example 1) $r_{1} = UAC UGU (tyrosine / cysteine),$ it is a point in I ^2×12 = I ²⁴ and represented by $r_{1} = (1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0) .$

We describe now how one can represent a sequence in the frame of the new proposed methodology using the number of nucleotides at the three base sites of a codon and the location of triplets in the genetic sequence. We describe this method by an application on the sequence r ₁. We follow the two steps and the results are summarized in Table 3.

Step 1. We calculate the number $a_{i}^{W}$ of nucleotides at the three base sites of a codon in the sequence r ₁, the sum $\sum_{j = 1}^{k} j s_{ij}^{W}$ of locations of triplets which contains the nucleotides in the sequence r ₁, and

Step 2. The number $φ_{i}^{W} = \frac{a_{i}^{W}}{\sum_{j = 1}^{k} j s_{ij}^{W}}$ are obtained for i = 1, 2, 3 and W ∈ {U, C, A, G}. (see Table 3).

As a result sequence r ₁ would be written in the I ¹² hypercube following the new methodology as $r_{1}^{''} = (\frac{2}{3}, 0, 0, 0, 0, 0, 1, \frac{1}{2}, \frac{1}{2}, 1, 0, 0) .$

Example 3. Consider the sequences $r_{2} = CAA UGU (glytamine / cysteine)$ and $r_{3} = UGU CAA (cysteine / glytamine) .$ where as we can see below we have the same frequency of presence of amino acids but at different positions along the polynucleotide chain.

The sequence r ₂ has the characteristic that it differs from r ₃ at the location of triplets.

In the I ²⁴ space these sequences are written: $r_{2} = (0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0)$ and $r_{3} = (1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0) .$

Following the methodology of [78] we have the corresponding results in Tables 4 and 5.

As a result sequence r ₂ would be written in the I ¹² space as $r_{2}^{'} = (0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0) .$ Following the methodology of Torres & Nieto the sequence r ₃ is represented in the I ¹² as: $r_{3}^{'} = (0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0) .$

Now, following our new methodology for the sequence r ₂ we present the corresponding results in Table 6.

As a result sequence r ₂ would be written in the I ¹² hypercube as: $r_{2}^{''} = (0.5, 1, 0, 0, 0, 0, 1, 0.5, 0.5, 0, 1, 0) .$

In a similar way following the new proposed methodology sequence r ₃ gives he corresponding results appearing in Table 7.

As a result sequence r ₃ would be written in the I ¹² hypercube as: $r_{3}^{''} = (1, 0.5, 0, 0, 0, 0, 0.5, 1, 1, 0, 0.5, 0) .$

Using Eq. (1) (respectively, Eq. (2)) we compute the distance (respectively, the degree of similarity) between the sequences r ₂ and r ₃ for the above three representations. So, we have:

In the case of the Sadegh-Zadeh representation: $r_{2} = (0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0)$ and $r_{3} = (1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0),$ then $d (r_{2}, r_{3}) = 1 {and sim (r_{2}, r_{3}) = 0 .$ Thus sequences r ₂, r ₃ appear to be completely different.

In the case of the Torres & Nieto representation: $r_{2}^{'} = (0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0)$ and $r_{3}^{'} = (0.5, 0.5, 0, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0.5, 0),$ then $d (r_{2}^{'}, r_{3}^{'}) = 0 {and sim (r_{2}^{'}, r_{3}^{'}) = 1 .$ Thus sequences r ₂, r ₃ appear to be exactly the same.

In the case of the new proposed representation: $r_{2}^{''} = (0.5, 1, 0, 0, 0, 0, 1, 0.5, 0.5, 0, 1, 0)$ and $r_{3}^{''} = (1, 0.5, 0, 0, 0, 0, 0.5, 1, 1, 0, 0.5, 0),$ then $d (r_{2}^{''}, r_{3}^{''}) = 0.5 {and sim (r_{2}^{''}, r_{3}^{''}) = 0.66 .$

The increase (respectively, decrease) in distance (respectively, in similarity) between polynucleotide is a quantitative measure of the number of differences in the chemical composition and sequential order of the polynucleotide (see [55]).

As we have seen in the three cases the calculated distances and similarities are different. It is of interest to discuss the observed differences. In the case of Sadegh-Zadeh representation the sequences appear to be completely different which is not exactly the case since there is a certain similarity of some parts. In the case of NTV metric, they appear to be exactly the same which is not true since they present differences in the location of aminoacidscodons. However, in the frame of the new proposed representation they have a degree of similarity which in our opinion represents reality in a more exact way, while in the same time we maintain the advantage of the I ¹² space representation which keeps computational cost very low.

Example 4. In this example we consider longer polynucleotides. Consider the sequences: $r_{4} = UAC UCG UAC (tyrosine / serine / tyrosine),$ $r_{5} = UAC UAC UCG (tyrosine / tyrosine / serine),$ and $r_{6} = UCG UAC UAC (serine / tyrosine / tyrosine) .$

The sequences r ₄, r ₅ and r ₅ have three triplets with the same frequency of presence of the amino acids but at different locations.

Following the methodology of Torres and Nieto [78] the corresponding results for r ₄ appear inTables 8 and 9.

As a result following the methodology of Torres & Nieto sequence r ₄ can be written in the I ¹² space as: $r_{4}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}),$

sequence r ₅ as: $r_{5}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}) .$

and sequence r ₆ as: $r_{6}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}) .$

This a characteristic example of case where many sequences correspond to the same representation in the frame of Torres and Nieto representation although they differ in the location of the amino acids along the polynucleotide chain.

Now, following our new methodology for sequence r ₄ we obtain the corresponding results appearing in Table 10.

As a result sequence r ₄ would be written in the I ¹² hypercube as $r_{4}^{''} = (\frac{3}{6}, 0, 0, 0, 0, \frac{1}{2}, \frac{2}{4}, 0, 0, \frac{2}{4}, 0, \frac{1}{2})$ or $r_{4}^{''} = (\frac{1}{2}, 0, 0, 0, 0, \frac{1}{2}, \frac{1}{2}, 0, 0, \frac{1}{2}, 0, \frac{1}{2}) .$

Similarly for the sequence r ₅ we have the corresponding results in Table 11.

As a result sequence r ₅ would be written in the I ¹² hypercube as $r_{5}^{''} = (\frac{1}{2}, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}) .$

Now, following our new methodology for the sequence r ₆ we obtain the results appearing in Table 12.

As a result sequence r ₆ would be written in the I ¹² hypercube as $r_{6}^{''} = (\frac{1}{2}, 0, 0, 0, 0, \frac{1}{1}, \frac{2}{5}, 0, 0, \frac{2}{5}, 0, \frac{1}{1}) .$

We see that the three sequences have different representations which is the advantage of the new methodology.

Using the Eq. (1) (respectively, the Eq. (2)) we compute the distance (respectively, the degree of similarity) between the sequences r ₄, r ₅ and r ₆ for the above three representations. So, we have:

In the case of Torres & Nieto the representations of the sequences are the following: $r_{4}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}),$ $r_{5}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}),$ and $r_{6}^{'} = (1, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}) .$

And the resulting distances and similarities are $d (r_{4}^{'}, r_{5}^{'}) = d (r_{4}^{'}, r_{6}^{'}) = d (r_{5}^{'}, r_{6}^{'}) = 0$ and $sim (r_{4}^{'}, r_{5}^{'}) = sim (r_{4}^{'}, r_{6}^{'}) = sim (r_{5}^{'}, r_{6}^{'}) = 1 .$

Thus the three representations appear to be exactly the same.

In the case of the new proposed methodology the representations of the sequences are: $r_{4}^{''} = (0.5, 0, 0, 0, 0, 0.5, 0.5, 0, 0, 0.5, 0, 0.5),$ $r_{5}^{''} = (0.5, 0, 0, 0, 0, \frac{1}{3}, \frac{2}{3}, 0, 0, \frac{2}{3}, 0, \frac{1}{3}),$ and $r_{6}^{''} = (\frac{1}{2}, 0, 0, 0, 0, \frac{1}{1}, \frac{2}{5}, 0, 0, \frac{2}{5}, 0, \frac{1}{1}) .$

And the resulting distances and similarities are $d (r_{4}^{''}, r_{5}^{''}) = 0.2352941177,$ $sim (r_{4}^{''}, r_{5}^{''}) = 0.8666666662,$ $d (r_{4}^{''}, r_{6}^{''}) = 0.3428571429,$ $sim (r_{4}^{''}, r_{6}^{''}) = 0.7931034483,$ $d (r_{5}^{''}, r_{6}^{''}) = 0.4869565216,$ and $sim (r_{5}^{''}, r_{6}^{''}) = 0.6781609196 .$

Again we see that in the case of Torres & Nieto the sequences appear to be exactly the same which does not reflect the underlying reality since differences exist in the location of codons. On the other hand in the frame of the new proposed representation one can differentiate among the three different sequences taking into account the location of the codons.

4 Conclusions

In the present paper we propose a new method of representation of genetic sequences as fuzzy sets in the I ¹² space which extends the method originally introduced by and Torres & Nieto and Sadegh-Zadeh. In this new representation the location of Amino Acids in the genetic sequence play important role on the resulting representation along with the frequency of presence of the codons.

We demonstrate the utility of the new methodology with its application in some simple but characteristic cases of polynucleotides using the new methodology as well as that of Torres & Nieto and that of Sadegh-Zadeh. The comparison of our results with the results of [66] and [78] shows that in that the new method of representation of genetic sequences gives a better image of genetic sequences as fuzzy sets in I ¹² space since it maintains the computational efficiency of the method of Torres & Nieto while in the same time gives the possibility to differentiate polynucleotide appearing to be the same in the previous formulation. Further studies are in progress in order to investigate in more detail the properties of these notions and their biological implications as well as combination with that of the concept of Chou’s pseudo amino acid which takes into account also several physical properties related to codons.

References

Ajay

Parker

Ozel Abaan

Fuentes Fajardo

Margulies

2011

Accurate and comprehensive sequencing of personal genomes

Genome Research 21 1498 1505

Chen

Zhou

Tian

Zou

Cai

2006 a

Predicting protein structural class with pseudo amino acid composition and support vector machine fusion network

Analytical Biochemistry 357 116 121

Chen

Tian

Zou

Cai

2006 b

Using pseudo amino acid composition and support vector machine to predict protein structural class

Journal of Theoretical Biology 243 444 448

Chen

2007 a

Prediction of apoptosis proteins ubcellular location using improved hybrid approach and pseudo amino acid composition

Journal of Theoretical Biology 248 377 381

Chen

2007 b

Prediction of the subcellular location of apoptosis proteins

Journal of Theoretical Biology 245 775 783

Chen

Lin

Feng

Ding

Zuo

Chou

2012

iNuc-PhysChem: A sequence-based predictor for identifying nucleosomes via physicochemical properties

PLoS One 7 e47843

Chen

2013

Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou’s pseudo amino acid composition, ournal of Theoretical Biology

318 1 12

K. Chou, Prediction of protein cellular attributes using pseudoamino acid composition, ProteinsŮUStructure, Function, and Genetics 43 (2001), 246–255. (Erratum: Prediction of protein cellular attributes using pseudo amino acid composition, Proteins UŮ Structure, Function, and Genetics 44 p. 60).

Chou

2005

Using amphiphilic pseudo amino acid composition to predict enzyme subfamilyclasses

Bioinformatics 21 10 19

10.

Chou

Cai

2003

Predicting protein quaternary structure by pseudo amino acid composition

Proteins-Structure, Function, and Genetics 53 282 289

11.

Chou

Cai

2004

Predicting enzyme family class in a hybridization space

Protein Science 13 2857 2863

12.

Chou

Cai

2006

Prediction of protease types in a hybridization space

Biochem Biophys Res Commun 339 1015 1020

13.

Chou

Shen

2006

Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers

J Proteome Res 5 1888 1897

14.

Chou

Shen

2007 a

Euk-mPLoc: A fusion classifier forlarge-scale eukaryotic protein subcellular location prediction by incorporating multiple sites

Journal of Proteome Research 6 1728 1734

15.

Chou

Shen

2007 b

Review: Recent progresses in protein subcellular location prediction

Analytical Biochemistry 370 1 16

16.

Chou

Shen

2007 c

Large-scale plant protein subcellular location prediction

Journal of Cellular Biochemistry 100 665 678

17.

Chou

Shen

2007 d

MemType-2L: A webserver for predicting membrane proteins and their types by incorporating evolution information through PseŰPSSM

Biochemical and Biophysical Research Communications 360 339 345

18.

Chou

Shen

2007 e

Signal-CF: A subsite-coupled and window-fusing approach for predicting signal peptides

Biochemical and Biophysical Research Communications 357 633 640

19.

Chou

Shen

2008

Cell-PLoc: A package of webservers for predicting subcellular localization of proteins in various organisms

Nature Protocols 3 153 162

20.

DasGupta

Jiang

Kannan

Sweedyk

1998

On the complexity and approximation of syntenic distance

Discrete Applied Mathematics 88 59 82

21.

De Luca

Termini

1972

A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory

Inform and Control 20 301 312

22.

Ding

Zhang

Chou

2007

Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network

Protein and Peptide Letters 14 811 815

23.

Dress

Lokot

2003

A simple proof of the triangle inequality for the NTV metric

Applied Mathematics Letters 16 809 813

24.

Dress

Lokot

Pustyl’nikov

2004

A new scaleinvariant Geometry of L₁ space

Applied Mathematics Letters 17 815 820

25.

2006

Prediction of protein submitochondria locations by hybridizing pseudo amino acid composition with various physicochemical features of segmented sequence

BMC Bioinformatics 7 5 18

26.

Jiao

2014

PseAAC-General: Fast building various modes of general form of ChouŠs pseudo-amino acid composition for large-scale protein datasets

International Journal of Molecular Sciences 15 3495 3506

27.

Engelking

1977

General Topology

Warszawa

28.

Fan

J-L

Y-L

2002

Some new fuzzy entropy formulas

Fuzzy Sets and Systems 128 277 284

29.

Fan

2012

Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition

Journal of Theoretical Biology 304 88 95

30.

Feng

Cai

Chou

2005

Boosting classifier for predicting protein domain structural class

Biochem Biophys Res Commun 334 213 217

31.

Feng

Chen

Lin

Chou

2013

iHSPPseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition

Anal Biochem 442 118 125

32.

Foster

Heath

Afzal

1999

Application of distance geometry to 3D visualization of sequence relation-ships

Bionformatics 15 89 90

33.

Gonzaga-Jauregui

Lupski

Gibbs

2012

Human genome sequencing in health and disease

Annu Rev Med 63 35 61

34.

Gusev

Nemytikova

Chuzhanova

1999

On the complexity measures of genetic sequences

Bioinformatics 15 994 999

35.

Hegalson

Jobe

1998

The fuzzy cube and causal efficacy: Representation of concomitant mechanisms in stroke

Neural Networks 11 549 555

36.

Jamshidi

Edwards

Fahland

Church

Palsson

2001

Dynamic simulation of the human red blood cell matabolic network

Bioinformatics 17 286 287

37.

Jiang

Lin

Zhang

2002

A general edit distance between RNA structures

Journal of Computational Biology 9 371 388

38.

Kandaswamy

Chou

Martinetz

Moller

Suganthan

Sridharan

Pugalenthi

2011

AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties

J Theor Biol 270 56 62

39.

Kosko

1992

Neural Networks and Fuzzy Systems

Prentice-Hall

Englewood Cliffs, NJ

40.

Land

Hauser

Jun

S-R

Nookaew

Leuze

Ahn

T-H

Karpinets

Lund

Kora

Wassenaar

Poudel

Ussery

2015

Insights from 20 years of bacterial genome sequencing

Functional and Integrative Genomics 15 141 161

41.

Liben-Nowell

2001

On the structure of syntenic distance

Journal of Computational Biology 8 53 67

42.

Badger

Chen

Kwong

Kearney

Zhang

2001

An information-based sequence distance and its application to whole mitochondrian phylogeny

Bioinformatics 17 149 154

43.

Liabres

Rossello

2004

A new family of metrics for biopolymer contact structures

Computational Biology and Chemistry 28 21 37

44.

Lin

1997

Adaptive subsethood for radial basis fuzzy systems

Kosko

429 464

Fuzzy Engineering, Prentice-Hall

Upper Saddle River, NJ

45.

Lin

2007 a

Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components

Journal of Computational Chemistry 28 1463 1466

46.

Lin

2007 b

Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant

Biochemical and Biophysical Research Communications 354 548 551

47.

Liu

Wang

Chou

2005

Low-frequency Fourier spectrum for predicting membrane protein types

Biochemical and Biophysical Research Communications 336 737 739

48.

Giulia

2005

Sublinear growth of information in DNA sequences

Bulletin of Mathematical Biology 67 737 759

49.

Mondal

Bhavna

MohanBabu

Ramakumar

2006

Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification

Journal of Theoretical Biology 243 252 260

50.

Morgenstern

2002

A simple and space-efficient fragmentchaining algorithm for alignment of DNA and protein sequences

Appl Math Lett 15 11 16

51.

Moulton

Zuker

Steel

Pointon

Penny

2000

Metrics on RNA secontary structures

Journal of Computational Biology 7 277 292

52.

Mundra

Kumar

Jayaraman

Kulkarni

2007

Using pseudo amino acid composition topredict protein subnuclear localization: Approached with PSSM

Pattern Recognition Letters 28 1610 1615

53.

Nieto

Torres

Vazquez-Trasande

2003

A metric space to study differences between polynucleotides

Appl Math Lett 16 1289 1294

54.

Nieto

Torres

2003

Midpoints for fuzzy sets and their application in medicine

Artificial Inteligence in Medicine 17 81 101

55.

Nieto

Torres

Georgiou

Karakasidis

2006

Fuzzy polynucleotide spaces and metrics

Bull Math Biology 68 703 725

56.

Paun

Rozenberg

Saloma

1998

DNA Computing: New Computing Paradigms

Springer

Berlin

57.

Percus

2002

Mathematics of Genome Analysis

Gambridge University Press

Cambridge

58.

Qin

Zheng

Huang

2013

Locating apoptosis proteins by incorporating the signal peptide cleavage sites into the general form of Chou’s Pseudo amino acid composition

International Journal of Quantum Chemistry article in press

59.

Qiu

Xiao

Chou

2014

iRSpot-TNCPseAAC: Identify recombination spots with trinucleotide composition and pseudo amino acid components

International Journal of Molecular Sciences 15 1746 1766

60.

Sebastian

Sabu

Ramakrishnan

2010

Multi-fuzzy sets

Int Math Forum 50 2471 2476

61.

Sebastian

Sabu

Ramakrishnan

2011

Multi-fuzzy sets: An extension of fuzzy sets

Fuzzy Inf Eng 1 35 43

62.

Sebastian

Sabu

Ramakrishnan

2011

Multi-fuzzy topology

Int J Appl Math 24 117 129

63.

Sebastian

Sabu

Ramakrishnan

2011

Multi-fuzzy subgroups

Int J Contemp Math Sci 6 365 372

64.

Sebastian

Sabu

Ramakrishnan

2011

Multi-fuzzy extensions of functions

Advance in Adaptive Data Analysis 3 339 350

65.

Sadegh-Zadeh

1999

Fundamentals of clinical methodology: 3. Nosology

Artificial Inteligence in Medicine 17 87 108

66.

Sadegh-Zadeh

2000

Fuzzy genomes

Artificial Intelligence in Medicine 18 1 28

67.

Sadovsky Michael

2003

The method to compare nucleotide sequences based on the minimum entropy principle

Bulletin of Mathematical Biology 65 309 322

68.

Saha

Maulik

Bandyopadhyay

Plewczynski

2012

Fuzzy clustering of physicochemical and biochemical properties of amino Acids

Amino Acids 43 583 594

69.

Shannon

1948

A mathematical theory of communication

The Bell Systems Technical Journal 27 379 423

70.

Shen

Chou

2005 a

Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo amino acid composition to predict membrane protein types

Biochemical and Biophysical Research Communications 334 288 292

71.

Shen

Chou

2005 b

Predicting protein subnuclear location with optimized evidence-theoretic K-nearestclassifier and pseudo amino acid composition

Biochemical and Biophysical Research Communications 337 752 756

72.

Shen

Chou

2006

Ensemble classifier for protein fold pattern recognition

Bioinformatics 22 1717 1722

73.

Shen

Yang

Chou

2006

Fuzzy KNN for predicting membrane protein types from pseudo amino acid composition

Journal of Theoretical Biology 240 9 13

74.

Shen

Chou

2007 a

Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites

Biochemical and Biophysical Research Communications 355 1006 1011

75.

Shen

Chou

2007 b

EzyPred: A top-down approach for predicting enzyme functional classes and subclasses

Biochemical and Biophysical Research Communications 364 53 59

76.

Shen

Chou

2007 c

Signal-3L: A 3-layer approach for predicting signal peptide

Biochemical and Biophysical Research Communications 363 297 303

77.

Tang

2000

Evaluation of some DNA cloning strategies

Computers Math Applic 39 43 48

78.

Torres

Nieto

2003

The fuzzy polynucleotide space:Basic properties

Bioinformatics 19 587 592

79.

Torres

Nieto

2006

Fuzzy logic in medicine and bioinformatics

Journal of Biomedicine and Biotechnology article ID 91908

80.

Wang

Yang

Chou

2006

Using stacked generalization to predict membrane protein types based on pseudo amino acid composition

Journal of Theoretical Biology 242 941 946

81.

Xiao

Shao

Huang

Chou

2006

Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor

Journal of Computational Chemistry 27 478 482

82.

Xiao

Wang

Chou

2011

GPCR-2L: Predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions

Mol Biosyst 7 911 919

83.

Wen

Deng

Chou

2014

iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition

PloS one 9 e105018

84.

Zaus

1999

Crisp and Soft Computing with Hypercubical Calculus

Physica-Verlag

Heideberg

85.

Zhao Si

2015

An overview of the prediction of protein DNA-binding sites

International Journal of Molecular Sciences 16 5194 5215

86.

Zheng

Wang

J-T

Liu

Chen

Jiang

S-W

2013

Advances in the techniques for the prediction of microRNA targets

International Journal of Molecular Sciences 14 8179 8187

87.

Zhou

Chen

Zou

2007

Using ChouŠs amphiphilic pseudo amino acid composition and support vector machine for prediction of enzyme subfamily classes

Journal of Theoretical Biology 248 546 551

88.

Urban

2013

Whole-genome sequencing in pharmacogenetics

Pharmacogenomics 14 345 348

An extension of fuzzy topological approach for comparison of genetic sequences

Abstract

Keywords

1 Introduction

2 Fuzzy sets and fuzzy hypercube

2.1 Fuzzy sets

2.2 NTV metric and similarity

3.1 Nieto and Tores representation of genetic sequences in I12

3.2 A new representation of genetic sequences in I12

4 Conclusions

References

3.1 Nieto and Tores representation of genetic sequences in I¹²

3.2 A new representation of genetic sequences in I¹²