Abstract
Bioinformatics is a relatively new discipline where Mathematics are applied in the analysis of genetic sequences. The analysis of the genetic material of living organisms which consist of nucleic acids DNA and RNA is of great importance for diagnosis and taxonomy reasons. In the present paper we propose a new methodology for the representation of genetic sequences as fuzzy sets in the I 12 space which can significantly improve the results of Sadegh-Zadeh and Torres & Nieto. An important characteristic of our proposed methodology is that the location of Amino acids along the genetic sequences play an important role thus extending in a significant way the computational efficiency advantage of genetic sequence representation. We present some characteristic examples using the new proposed methodology where we calculate the distance and similarity degree of given polynucleotides.
Introduction
The study of genetic sequences is of great importance in biology and medicine with sequence analysis and taxonomy being two major fields of applications of bioinformatics. In this course there are two basic strategies that are commonly employed: a) sequence analysis, i.e. determination of the building blocks of a nucleic acid (nucleotides) and their order in the molecular chain, and b) sequence comparison used to identify the degree of difference/similarity between polynucleotide, e.g. in order to identify similarity with known viruses.
DNA and RNA are made of triplets XYZ of codons each of them having the possibility to be one of four nucleotides {U, C, A, G} in the case of DNA and {T, C, A, G} in the case of RNA (A=Adenine, C=Cytosine, G=Guanine, T=Thymine, U=Uracil). DNA sequencing methodologies are important for biology and medicine in a broad range of applications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA and RNA are in general very lengthy chains and with the rapid increasing number of genome sequences DNA sequencing technologies should be fast and accurate [1]. A sequence similarity search often provides the first information about a new DNA or protein sequence. A search allows scientists to infer the function of a sequence from similar sequences [33]. Several applications concern prediction of protein DNA-binding sites [85] since interactions between proteins and DNA play an important role in a large number of biological processes such as DNA replication, splicing, and repair. Identification of amino acid residues involved in DNA-binding sites is important for understanding mechanisms of biological activities. Other applications concern bacterial sequencing [40] and MicroRNAs (miRNAs) which are small, non-coding, endogenous RNA molecules that play important roles in a variety of normal and diseased biological processes [86] as well as in pharmacogenetics [88]. There are several methodologies for sequencing analysis, such as covariance discriminant [6], neural networks [30], support vector machine [31], random forest [38], nearest neighbor [12], K-nearest neighbor [13], and Fuzzy K-nearest neighbor [82] just to mentionfew ones.
The fuzzy methods have the advantage to incorporate the uncertainties that exists for the data in the model ([54], [79]). Sadegh-Zadeh [66] demonstrated that nucleic acids (DNA and RNA) can be treated as ordered fuzzy sets in a 12-dimensional space. In this frame the genetic code can be represented in a 12-dimensional space because a triplet codon XYZ has a 3 × 4 =12 dimensional fuzzy code (a 1, . . . , a 12) corresponding to a point in the 12-dimensional fuzzy polynucleotide space [0, 1] 12. Sadegh-Zadeh [66] and Nieto et al. (see [78], [53] and [54]) introduced the Fuzzy Polynucleotide Space (FPS) based on the principle of the fuzzy hypercube [39]. A polynucleotide consisting of a sequence of k triplets XYZ is a point in a I 12×k space, where I = [0, 1]. However, in [78] the authors mapped a polynucleotide on a I 12 space by considering the frequencies of appearance of the nucleotides at the three base sites of a codon in the coding sequence. In that work using a metric motivated by publications [44] and [66], the authors calculated distances between nucleotides as well as they applied their algorithm for the comparison of complete genomes (such as M.tuberculosis and E.coli). Further work has been recently performed using the same principle (see [55]) in which the influence of several metrics have been examined in the procedure of comparison of nucleotides. The advantages of this methodology which in fact reduces the information tobookmark are:
a) one can compare polynucleotides of very large length in a very computationally efficientway, since the whole information of location of codons is transformed to frequencies of presence in a 12-dimensional space, and
b) one can apply the algorithm in order to compare polynucleotides of different length as it is the case for genomes of different organisms since all chains are reduced to a 12-dimensional space.
Another very interesting approach of representing polynucleotides for comparison reasons is the concept of pseudo amino acid (pseAA) composition which was introduced by Chou (see [8]). In this approach a significant effort has been performed in order to use various digital numbers aiming to represent the 20 amino acid in order to better reflect the sequence-order effects taking into account physical properties. The pseAA composition was originally introduced to improve the prediction quality for protein subcellular localization and membrane protein type ([8]), as well as for enzyme functional class ([9]). This concept can be employed in the representation of a protein sequence with a discrete model without however losing completely its sequence-order information ([14],15). Thus it is particularly useful for the analysis of a large amount of complicated protein sequences by means of the taxonomic approach. This methodology has been widely used to study various protein attributes, such as protein structural class ([2, 3], [81], [45], [22]) protein subcellular localization ([19], [14, 15], [74]), protein subnuclear localization ([70, 71], [52]) protein submitochondrial localization ([25]) protein oligomer type ([10]) conotoxin super-family classification ([49], [46]) membrane protein type ([47], [70], [71], [80], [73], [16], [17], [7]) apo ptosis protein subcellular localization ([4], [5], [58]) enzyme functional classification ([9], [11], [87], [75]) proteinfold pattern ([72]) and signal peptide ([18], [76]) predict mycobacterial proteins subcellular locations ([29]). Some more recent work based on this principle concern ([59], [58], [26], [83], [68]). We point that the metrics employed to calculate the difference/similarity play an important role on computational biology. Different metrics have been used to study secondary structures (see [51]) or biopolymer contact structures (see [43]).
The approach of Nieto et al. mentioned above, though very interesting from a computational point of view as mentioned already, it presents some drawbacks since it does not take into account the detailed location of the Amino acids along the polynucleotide chain. However, a vector defined in a discrete model may completely lose all the sequence-order information. Thus it cannot distinguish two polynucleotides having the same frequencies of appearance of the amino acids but located at different locations along the polynucleotide chain. However, it is very important to be in a position to determine how close two genetic sequences are since there are many important biological and medical implications (see [20], [32], [34], [37], [41]and [42]).
In the present paper we introduce a new methodology for the representation of genetic sequences in the I 12 space where the location of the amino acids along the genome sequence is taken into account along with the frequency of presence and thus it is more effective in distinguishing polynucleotides with the same number of amino acids located however at different positions. In that aim we present some characteristic examples of comparison of the results obtained in [66] and [78] with that obtained using the methodology we propose in the present paper.
The structure of the paper is as follows. In Section 2, we present some known notions for fuzzy sets and fuzzy hypercube. In Section 3, we give a new representation of genetic sequences in I 12 and compare the results with the results of [66] and [78]. Finally in Section 4 the conclusions of the present work are summarized.
Fuzzy sets and fuzzy hypercube
Fuzzy sets
Let X be a set. A is a fuzzy subset of X if A = {(x, μ A (x)) : x ∈ X}, where μ A is a function of X into [0, 1] = I, that is A is the set of all pairs (x, μ A (x)) such that x ∈ X and μ A (x) is the degree of its membership in A.
In what follows if X = {x
1, x
2, . . . , x
n
} and
For a given set X = {x 1, x 2, . . . , x n }, the set of all fuzzy subsets (of X) is precisely the unit hypercube I n = [0, 1] n , since any fuzzy subset A determines a point P ∈ I n given by P = (μ A (x 1) , . . . , μ A (x n )) (see [39]).
Also, any point P = (a 1, . . . , a n ) ∈ I n generates a fuzzy subset A of X defined by the map μ A : X → [0, 1] such that μ A (x i ) = a i , i = 1, 2, . . . , n.
Nonfuzzy or crisp subsets of X = {x
1, . . . , x
n
} are given by mappings:
Hypercubical calculus is developed in [84], and some applications of the fuzzy unit hypercube are given in [54], [65] and [35]. In this context a codon corresponds to a corner of the 12-dimensional unit hypercube I 12. Any element of I 12 may be viewed as a fuzzy codon.
DNA and RNA can be treated as a language written using an alphabet of strings. The role of strings is played by several chemical compounds. In fact the alphabet for DNA is {T, C, A, G} while for RNA {U, C, A, G} where A,C,G,T and U stand for Adenine, Cytosine, Guanine, Thymine and Uracil respectively. In this context in the case of RNA alphabet if U is the first letter of this alphabet one codes it as (1, 0, 0, 0), that is a four dimensional multi-fuzzy membership value (see, [60], [61], [62], [63], and [64]), 1 because the first letter U is present, 0 since the second letter does not appear, 0 since the third letter is not present and 0 since the fourth letter G does not appear. In a similar way C is represented as (0, 1, 0, 0), A as (0, 0, 1, 0) and G as (0, 0, 0, 1). So if we have a nucleotide described by the codon UCG (serine) this would be written in the I
12 hypercube as:
NTV metric and similarity
Consider the n-dimensional unit hypercube I
n
. If
The distance d is motivated by publications of [44] and [66]. We know that d is a metric [53] and has already been employed in [78] and [55]. In [23] (see, also [24]) it is proposed to call this metric as the NTV metric.
Let X = {x
1, . . . , x
n
} be a set and A = (a
1, . . . , a
n
) , B = (b
1, . . . , b
n
), where a
i
, b
i
∈ [0, 1], two fuzzy sets of X. The degree of similarity between A and B (see [55]), denoted by sim (A, B), is defined as: If A and B not equal to (0, . . . , 0) simultaneously, then
Let r = X 1 Y 1 Z 1 . . . X k Y k Z k , where X i , Y i , Z i ∈ {U, C, A, G} be a genetic sequence with k-triplets. Hence the sequence has 3k letters (nucleotides).
Let W ∈ {U, C, A, G}, i = 1, 2, 3, and j ∈ {1, . . . , k}. Then, by we denote the number which is defined as follows: if W appears at the i-coordinate of j-triplet, otherwise . Also, by we denote the following number:
Nieto and Tores representation of genetic sequences in I12
Following the methodology of [78] we calculate the frequencies (fractions) of the nucleotide at the three base sites in order to obtain their fuzzy representation in the I 12 hyperspace.
To compute the fractions of nucleotides at the three base sites of triplet we divide by the number k of triplets. So, we have:
Finally, they write the genetic sequence r as a point in the hypercube I
12 as follows:
since U appears 2 times at the first position of the codons of the sequence,
since U does not appear at any second position of the codons of the sequence,
and since the U appears 1 time at the third position of the codons of the sequence.
In the same way we obtain , , , , , , , , (see Table 1).
Following the methodology of Nieto and Tores we calculate the frequencies (fractions) of the nucleotide at the three base sites in order to obtain their fuzzy representation in the I 12 hyperspace. So, we have: , , , , , , , , , and (see Table 2).
As a result sequence r
1 would be written in the I
12 space as
A new representation of genetic sequences in I12
Let r = X 1 Y 1 Z 1 . . . X k Y k Z k , where X i , Y i , Z i ∈ {U, C, A, G} be a genetic sequence, W ∈ {U, C, A, G}, i = 1, 2, 3, and j ∈ {1, . . . , k}. In the new representation of genetic sequence in I 12 we follow two steps described below.
As a result sequence r would be written in the I
12 hypercube following the new methodology as
In the following we present some applications of the new methodology calculating the distance of selected the polynucleotide and their degree of similarity and we compare the results with that obtained using the method of Torres & Nieto in order to stress the advantages of the new method. We note that a single prime corresponds to the NTV representation and a double representation to the new representation.
We describe now how one can represent a sequence in the frame of the new proposed methodology using the number of nucleotides at the three base sites of a codon and the location of triplets in the genetic sequence. We describe this method by an application on the sequence r 1. We follow the two steps and the results are summarized in Table 3.
As a result sequence r
1 would be written in the I
12 hypercube following the new methodology as
The sequence r 2 has the characteristic that it differs from r 3 at the location of triplets.
In the I
24 space these sequences are written:
Following the methodology of [78] we have the corresponding results in Tables 4 and 5.
As a result sequence r
2 would be written in the I
12 space as
Now, following our new methodology for the sequence r 2 we present the corresponding results in Table 6.
As a result sequence r
2 would be written in the I
12 hypercube as:
In a similar way following the new proposed methodology sequence r 3 gives he corresponding results appearing in Table 7.
As a result sequence r
3 would be written in the I
12 hypercube as:
Using Eq. (1) (respectively, Eq. (2)) we compute the distance (respectively, the degree of similarity) between the sequences r 2 and r 3 for the above three representations. So, we have:
In the case of the Sadegh-Zadeh representation:
In the case of the Torres & Nieto representation:
In the case of the new proposed representation:
The increase (respectively, decrease) in distance (respectively, in similarity) between polynucleotide is a quantitative measure of the number of differences in the chemical composition and sequential order of the polynucleotide (see [55]).
As we have seen in the three cases the calculated distances and similarities are different. It is of interest to discuss the observed differences. In the case of Sadegh-Zadeh representation the sequences appear to be completely different which is not exactly the case since there is a certain similarity of some parts. In the case of NTV metric, they appear to be exactly the same which is not true since they present differences in the location of aminoacidscodons. However, in the frame of the new proposed representation they have a degree of similarity which in our opinion represents reality in a more exact way, while in the same time we maintain the advantage of the I 12 space representation which keeps computational cost very low.
The sequences r 4, r 5 and r 5 have three triplets with the same frequency of presence of the amino acids but at different locations.
Following the methodology of Torres and Nieto [78] the corresponding results for r 4 appear inTables 8 and 9.
As a result following the methodology of Torres & Nieto sequence r
4 can be written in the I
12 space as:
sequence r
5 as:
and sequence r
6 as:
This a characteristic example of case where many sequences correspond to the same representation in the frame of Torres and Nieto representation although they differ in the location of the amino acids along the polynucleotide chain.
Now, following our new methodology for sequence r 4 we obtain the corresponding results appearing in Table 10.
As a result sequence r
4 would be written in the I
12 hypercube as
Similarly for the sequence r 5 we have the corresponding results in Table 11.
As a result sequence r
5 would be written in the I
12 hypercube as
Now, following our new methodology for the sequence r 6 we obtain the results appearing in Table 12.
As a result sequence r
6 would be written in the I
12 hypercube as
We see that the three sequences have different representations which is the advantage of the new methodology.
Using the Eq. (1) (respectively, the Eq. (2)) we compute the distance (respectively, the degree of similarity) between the sequences r
4, r
5 and r
6 for the above three representations. So, we have: In the case of Torres & Nieto the representations of the sequences are the following:
And the resulting distances and similarities are
Thus the three representations appear to be exactly the same. In the case of the new proposed methodology the representations of the sequences are:
And the resulting distances and similarities are
Again we see that in the case of Torres & Nieto the sequences appear to be exactly the same which does not reflect the underlying reality since differences exist in the location of codons. On the other hand in the frame of the new proposed representation one can differentiate among the three different sequences taking into account the location of the codons.
Conclusions
In the present paper we propose a new method of representation of genetic sequences as fuzzy sets in the I 12 space which extends the method originally introduced by and Torres & Nieto and Sadegh-Zadeh. In this new representation the location of Amino Acids in the genetic sequence play important role on the resulting representation along with the frequency of presence of the codons.
We demonstrate the utility of the new methodology with its application in some simple but characteristic cases of polynucleotides using the new methodology as well as that of Torres & Nieto and that of Sadegh-Zadeh. The comparison of our results with the results of [66] and [78] shows that in that the new method of representation of genetic sequences gives a better image of genetic sequences as fuzzy sets in I 12 space since it maintains the computational efficiency of the method of Torres & Nieto while in the same time gives the possibility to differentiate polynucleotide appearing to be the same in the previous formulation. Further studies are in progress in order to investigate in more detail the properties of these notions and their biological implications as well as combination with that of the concept of Chou’s pseudo amino acid which takes into account also several physical properties related to codons.
