An Efficient Approach to Explore and Discriminate Anomalous Regions in Bacterial Genomes Based on Maximum Entropy

Abstract

Recently, there has been an increase in the number of whole bacterial genomes sequenced, mainly due to the advancing of next-generation sequencing technologies. In face of this, there is a need to provide new analytical alternatives that can follow this advance. Given our current knowledge about the genomic plasticity of bacteria and that those genomic regions can uncover important features about this microorganism, our goal was to develop a fast methodology based on maximum entropy (ME) to guide the researcher to regions that could be prioritized during the analysis. This methodology was compared with other available methods. In addition, ME was applied to eight different bacterial genera. The methodology consists of two main steps: processing the nucleotide sequence and ME calculation. We applied ME to Xanthomonas axonopodis pv. citri 306 (XAC) and Xanthomonas campestris pv. campestris ATCC 33913 (XCC), both of which have their anomalous regions well documented. We then compared our results against those from Alien Hunter, HGT-DB, Islander, IslandPath, and SIGI-HMM. ME was shown to be superior in terms of efficiency and analysis duration. Besides, ME only needs the genome sequence in FASTA format as input. The proposed strategy based on ME is able to help in bacterial genome exploration. This is a simple and fast strategy for individual genomes in comparison with other available methods, without relying on previous annotation and alignments. This methodology can also be a new option in the early stages of analysis of newly sequenced bacterial genomes.

1. Introduction

Nowadays, efficient whole genome sequencing techniques yield a large volume of data at increasingly faster rates (Loman et al., 2012). This expansion of genomes available in public databases, in particular from the bacteria domain (Pagani et al., 2012), does not necessarily reflect scientific development. In addition, as discussed by Baker (2010), the advancement of next-generation sequencing leads to the necessity to rethink the bioinformatic analysis tools so that it can follow this progress by improving its performance in the shortest time possible. Therefore, novel and efficient whole-genome analysis strategies are needed in light of the rapidly growing volume of data available.

One way to improve genome analysis is to consider its organization in terms of local nucleotide distribution. More specifically, the distribution of oligonucleotides in bacterial chromosomes is relatively stable along a DNA strand and it is particular for each organism (Karlin and Burge, 1995; Bohlin and Skjerve, 2009). This information can be used to analyze the genome and may lead to important knowledge about the organism. This approach is particularly relevant due to bacterial genome plasticity, where atypical regions may consist of exchanged DNA acquired by horizontal gene transfer (HGT), bacteriophages, and transposable elements, which can bring some adaptive advantages to them (Casjens, 1998; Monteiro-Vitorello et al., 2005; Darmon and Leach, 2014).

Great effort has been spent in attempts to identify anomalous regions in prokaryotes. For instance, Garcia-Vallve et al. (2003) developed a database, HGT-DB, of putative horizontally transferred genes in complete bacterial genomes using codon usage and G + C content. Another important approach is the IslandPath, which corresponds to a web-accessible service to identify genomic islands (GIs) based on %G+C and dinucleotide bias (Hsiao et al., 2003). ISlander is another database of integrative islands identified from tRNA genes (Mantri and Williams, 2004). Later, Waack et al. (2006) proposed the software SIGI-HMM to predict GIs in prokaryotic genomes using hidden Markov models, and Vernikos and Parkhill (2006) introduced the method Interpolated Variable Order motifs, known also as Alien Hunter, with the same objective.

Therefore, knowing that anomalous genomic regions can provide specific and important features that distinguish closely related bacterial lineages and that each available approach used to identify them has different characteristics and methods, the challenge is improving analysis efficiency (EF) in terms of performance, time, and complexity. Thus, our aim was to develop a simple and effective approach based on maximum entropy (ME) to quickly reveal anomalous regions in bacterial genomes.

The entropy (Shannon, 1948) can be understood as a measure of the uncertainty of a system, in this case, the genome. ME was originally proposed by Jaynes (1957a, b) in statistical mechanics, which is obtained from the probability distribution derived by selecting the higher uncertainty, namely ME (Guiasu and Shenitzer, 1985).

It is important to note that the ME-based analysis sole requirement is the genome sequence in FASTA format as input. There is no need for annotations or alignment concepts. Besides, genome sequences already annotated can be used to assess which proteins are encoded for in each region.

2. Methods

The ME methodology stands out from currently available approaches due to its simplicity and reduced computational time, consisting of basically two main steps: sequence processing and ME calculation.

2.1. Processing the sequences

The genome (G) may be defined as a finite sequence containing the distribution of four bases [G = (A, C, G, T)]. The aim is to identify regions with varying oligonucleotide frequencies within the individual chromosome. The analysis is performed according to the size of the target region (2 or 5 kb, for example) and the size of the sliding window, which can be defined as 2, 4, or 6 nucleotides long.

A sliding window is applied to each target genomic region and this slides along the sequence one base at time, accounting for the frequency of each mono-, di-, or trinucleotide that appears adjacent to the other mono-, di-, or trinucleotide. The values for each region are stored in an adjacency matrix and these are used for ME calculation. Figure 1a shows an example of how this process is carried out.

FIG. 1.

Genomic sequence analysis according to the maximum entropy (ME) method. (a) Diagram illustrating how genomic sequences are processed and oligonucleotide frequency calculated. In this example, a six-nucleotide-long sliding window (underlined sequence) is applied to the target genomic region. The method calculates how many times the trinucleotide AAA appears adjacent to the subsequent trinucleotide (i.e., AAA, AAC, AAG, or AAT) by moving the sliding window along the DNA sequence one nucleotide at a time to include all possible combinations (nodes and edges). The process is repeated for each trinucleotide and values are displayed in an adjacency matrix. G = genomic sequence. (b) ME is calculated from frequencies obtained in the previous step (a). Each oligonucleotide sequence (six in the above example) is assigned a number representing how many times it appears in the genomic sequence. In this case, it generated 63 ME values, exemplifying, in the ME₁, AAA frequency represents H₁ and the sum of the other values represents H₂. In the ME₂, the sum of the AAA and AAC frequencies represents H₁ and the sum of the other values represents H₂, and so on. Note that t in equations corresponds to a different position of H₁ e H₂ in each round (ME₁, ME₂, etc.). At the end, we will have 63 ME values, where the higher value (ME_max) is chosen to represent the genomic region in analysis. This process is applied to all genomic regions of the bacterial genome and the ME values representing each one are displayed in a graphic.

2.2. ME calculation

The ME method is then applied to each adjacency matrix generated in the previous step (Processing the sequences). Thus, the method first assigns an ME value to each genomic coordinate, and then combines these values to give a signature ME pattern for each genomic region (Fig. 1b). According to Jaynes (1957a, b), ME is mathematically defined as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { H_1 } ( t ) = - \mathop \sum \limits_ { i = 0 } ^t { \left( { { \frac { h ( i ) } { \mathop \sum \limits_ { j = 0 } ^t { h ( j ) } } } \log \; ( 2 ) { \frac { h ( i ) } { \mathop \sum \limits_ { j = 0 } ^t { h ( j ) } } } } \right) } \tag { 1 } \end{align*} \end{document}

and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { H_2 } ( t ) = - \mathop \sum \limits_ { i = t + 1 } ^ { { i_ { max } } } { \left( { { \frac { h ( i ) } { \mathop \sum \limits_ { j = t + 1 } ^ { { i_ { max } } } { h ( j ) } } } \log \; ( 2 ) { \frac { h ( i ) } { \mathop \sum \limits_ { j = t + 1 } ^ { { i_ { max } } } { h ( j ) } } } } \right) } \tag { 2 } \end{align*} \end{document}

where t corresponds to the position in the window in each round of analysis (ME₁, ME₂, etc) and h(i) is the frequency of all oligonucleotides present in the first part, i is the position of the oligonucleotide set, and h(j) is the total oligonucleotide frequency (Fig. 1b). The ultimate ME value is then calculated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} M{E_t} = {H_1} + {H_2} \tag{3} \end{align*} \end{document}

Figure 1b shows a representation of this process where, at the end, ME is used to represent the nucleotide distribution of each genomic region.

Once ME values along the chromosome have been calculated, their distribution can be visualized graphically. It is important to emphasize that the smaller the window size, the greater the variation in the graphic resolution will be. With that in mind, we evaluated the system's performance using different target genomic window sizes (2, 5, 10, and 50 kb). Furthermore, for each tested genomic window size, we also evaluated different word sizes (2, 4, and 6 nucleotides) to assess the ability of the ME method to identify anomalous regions using each parameter.

Considering that we are looking for regions with the highest ME values (anomalous regions), values around and below the median can be considered more stable. In the present study, we selected to explore within the ME distribution, and ME values greater or equal to the third quartile (Q₃) are indicated as putative anomalous regions (PAR = ME ≥Q₃).

The results are displayed graphically as a line chart with the ME distribution along the genome, their coordinates, and the predicted anomalous regions (Fig. 2).

FIG. 2.

A simplified representation of the genomic sequence analysis methodology developed herein based on ME proposed to identify anomalous regions. The analysis consists of two main steps: a sequence processing step during which the target genome sequence, in FASTA format, is analyzed using a 2, 4, or 6-nucleotide-long sliding window, followed by the ME calculation step. The result is a graphic representation of the ME values' distribution along the target sequence. Sequences with ME values above the upper quartile (Q₃) were considered for further analyses. Besides, for genomes with annotation, Supplementary figures can be obtained. These include a list of the proteins encoding genes within the anomalous regions and information about the possible presence of ribosomal RNA-coding sequences within the anomalous regions.

2.3. Evaluating the proposed methodology

The following genomes were used to evaluate the proposed approach: Xanthomonas axonopodis pv. citri 306 (XAC) and Xanthomonas campestris pv. campestris ATCC 33913 (XCC). The genomes from these organisms have had their anomalous regions well characterized (Van Sluys et al., 2002; Lima et al., 2005, 2008). Thus, the coding DNA sequences (CDS) within their anomalous regions were defined as true positives by default and the other regions were defined as true negatives. From here, it was possible to calculate the sensitivity (% − SN) and specificity (% − SP) for each analysis and, consequently, the efficiency (% − EF: (SN + SP)/2) and F-score (%) (Powers, 2011). Furthermore, these genomes were also used to evaluate the competing methods (Alien Hunter, HGT-DB, Islander, IslandPath, and SIGI-HMM) and to compare their results against the proposed ME approach. All the analyses were performed on the same computer (Intel Core i3-540 Processor and 8 GB RAM) to illustrate the reduced requirements in terms of computational power and execution time to perform the experiments efficiently. The methods available only as database (HGT-DB and Islander) were performed in the web and results downloaded from its websites.

2.4. Applying the ME methodology to other chromosomes

For the ME approach evaluation, besides Xanthomonas, another seven bacterial genera were analyzed: Bradyrhizobium, Rhizobium, Sinorhizobium, Burkholderia, Escherichia, Haemophilus, and Pseudomonas. Next, to confirm that ME can discriminate anomalous regions in the chromosome, we randomly selected regions of low, median, and high ME values from one organism of each chosen species to verify which of these regions are shared within the same species and between species of the same genus. To this end, we performed BLASTn and we compared the coverage (%) of each selected region against the other organisms from the same genus (Altschul et al., 1990). The coverage corresponds to the percentage of the selected regions that are found in other bacterial genomes belonging to the same genus and species.

The list of bacterial species used can be found in Supplementary Figure S1. All the analyzed chromosomes were downloaded from NCBI database (ftp://ftp.ncbi.nlm.nih.gov/genomes).

3. Results and Discussion

Due to the ease with which prokaryotic genomes can be sequenced nowadays, the volume of newly generated data is on the rise at accelerated rates. To accompany this increase, several nucleotide sequence analysis tools have been developed, each with different objectives. The present study proposes the use of ME as an alternative approach to explore individual bacterial genomes, focusing on the identification of anomalous regions. The anomalous regions could contain distinctive features of organisms, thereby providing important information in a fast and effective way. Furthermore, it does not need any additional information a priori, for instance, the annotation and alignments, to perform the ME method.

It is important to clarify that ME is used to discriminate anomalous regions that do not necessarily correspond to HGT events, but regions that have a greater discrepancy in the frequency of oligonucleotides. To confirm whether some of these anomalous regions are indeed from HGT events, additional analyses are still warranted. ME may play an important role in the initial stages of this kind of investigation, being useful to prioritize regions to be explored in more detail.

The proposed approach consists of scanning the target genomic sequence using a predetermined sliding window size and calculating the frequency of di-, tetra-, or hexanucleotides within each window. ME is then calculated for each window, and the highest value is chosen to represent the respective region and the resulting ME distribution may be visualized graphically. ME values greater than or equal to Q₃ are considered to be PARs, as explained in the Methods section. To evaluate which would be the best parameters to analyze, we have applied ME to evaluate two genomes of Xanthomonas with different parameter combinations; the results are presented in Figure 3.

FIG. 3.

Evaluation of parameters (window size [kb] and word size [di, tetra, or hexanucleotide]) to be used in the identification of anomalous regions of bacterial genomes based on ME. (a) ME applied to X. axonopodis pv. citri 306; (b) ME applied to X. campestris pv. campestris ATCC 33913. SN, sensitivity; SP, specificity; EF, efficiency and F-score.

The correct identification of atypical regions in bacterial genomes naturally depends on the adopted method and the discriminant criteria because the result can be skewed according to the parameters' chosen settings. Because of this, different methods have been developed to provide researchers with analytical tools that will optimally highlight target genomic features of interest such as alien genetic sequences (Azad and Lawrence, 2011).

On the other hand, visual analysis of the line chart shows that the smaller the window size, the lower the resolution, thereby limiting the researcher's ability to explore and interact with the graph (Supplementary Figs. S2 and S3). Alternatively, iterative analysis cycles using different parameter settings may be performed to find which best fits the dataset. This could facilitate the analysis since with ME, we did not observe efficiency loss (Fig. 3).

Moreover, ME may be considered a fast in silico approach to the universal analysis of bacterial genomes. It spent ∼7 minutes to analyze each Xanthomonas (5 Mb) genome.

Furthermore, we also compared the ME results against other competing methods. A detailed description of these methods can be found in Table 1. Regarding the requirements, just ME and Alien Hunter are independent of annotation. When compared, ME was faster in the analysis of Xanthomonas genomes (Table 1). Another advantage of the ME method is that it will be available as a stand-alone package. IslandPath and SIGI-HMM are faster, but their performance is worse than that of the proposed approach (Fig. 4). Regarding databases (HGT-DB and Islander), the predictions had already been completed, then we only downloaded the results and calculated their efficiency.

FIG. 4.

Currently available methods' performance comparison with regard to predicting alien genes based on EF and F-score. (a) Methods applied to X. axonopodis pv. citri 306; (b) X. campestris pv. campestris ATCC 33913.

Table 1.

Features of the Methods Currently Available to Detect Anomalous Regions in Bacterial Genomes Compared with Maximum Entropy-Based Approaches

Method/software	Availability	Discriminant criterion	Measure	Input	Output	^a Time of analysis
Maximum entropy	Stand-alone package	di-, tetra-, or hexamer	Maximum entropy	FASTA format	Coord/Proteins	7 minutes
Alien_Hunter	Stand-alone package	Octamer frequency	IVOM + HMM	FASTA format	Coord	33 minutes
HGT-DB	Database	Codon usage and G + C content		—	—	—
Islander	Database	Integrative Island–tRNA		—	—	—
IslandPath	Stand-alone package and database	GC and dinucleotide bias		NCBI files (.ppt/.faa/.ffn)	Coords	34 seconds
SIGI-HMM	Stand-alone package	Codon usage	HMM	EMBL or GBK format	GFF/Coords	36 seconds

Mean time spent on analysis of each Xanthomonas genome.

Figure 4 displays a comparative analysis of the methods available with regard to their F-score and efficiency in predicting atypical regions in two Xanthomonas genomes. All analyses were performed on the same computer to standardize any potential computational power-related variables, thereby allowing us to assess the computational requirements of each method on a level ground.

Overall, the ME approach demonstrated a higher predictive efficiency (EF = 82% and EF = 91% in both cases) compared with the competing methods tested herein. Similarly, ME also showed superior results in terms of F-score (Fig. 4). The second most efficient method was Alien Hunter (EF = 54% and EF = 60%), which most closely resembles the proposed methodology, followed by the SIGI-HMM method (Fig. 4). Therefore, considering the performance and computational requirements in terms of predictive power and time, ME provides a better option for this type of analysis because of its ease of use and speed. Furthermore, all it requires is the raw genomic sequence in FASTA format. This means that the ME method may be applied to newly sequenced genomes to expedite the analysis and direct the researchers' attention toward genomic regions presenting higher entropy.

We used the ME methodology to analyze the genomes of XCC ATCC 33913 as well as seven other bacterial genomes to include Bradyrhizobium sp. ORS 278, Rhizobium etli CFN 42, Sinorhizobium meliloti SM11, Burkholderia pseudomallei 1106a, Escherichia coli 536, Haemophilus influenzae Rd KW20, and Pseudomonas aeruginosa DK2. From each chromosome, we randomly selected regions with low ME values, with ME values around the median, and regions with higher ME values to show that regions with higher ME are the least conserved. To this end, we compared the percentage coverage of each selected region between the members of the same species and within the same genus. This procedure was important to verify whether the high ME values correspond to the real possibility of finding characteristics not shared by all. The results can be found in Supplementary Figure S4.

The genus Xanthomonas has been much studied as it encompasses plant pathogenic bacteria of economic importance. Specifically, XCC ATCC 33913 causes black rot in crucifers, its genome comprises a circular chromosome of 5,076,187 bp and no plasmid (da Silva et al., 2002). The ME profile of XCC ATCC 33913 is displayed in the Figure 5a, each point in the line chart corresponds to genomic regions with 50 kb. As we can observe, the ME distribution along the chromosome showed much variation, which corroborates the fact that this genome is rich in transposable elements (da Silva et al., 2002). In addition, this genome has 35 laterally transferred GIs as already described by Lima et al. (2005, 2008), another factor that also explains this variation in entropy values.

FIG. 5.

ME applied to Xanthomonas campestris pv. campestris ATCC 33913 (representative of genus Xanthomonas). (a) ME applied to genome of X. campestris pv. campestris ATCC 33913, regions of 50 kb; (b) coverage of selected genomic regions' interspecies of genus Xanthomonas (13 organisms); (c) coverage of genomic regions' intraspecies of Xanthomonas campestris (four genomes). Information about these bacteria is available in Supplementary Figure S1.

The percentage coverage of selected regions with low, median, and high ME values within other member species of the genus Xanthomonas (Fig. 5b) had, as expected, more variation within the genus than within organisms of the same species Xanthomonas campestris (Fig. 5c). Therefore, we can conclude that regions with low and median ME values are even more conserved within these bacteria. On the other hand, the selected coordinates with the highest ME values correspond to the regions (38, 50, 60, 75, 79) with the greatest degree of variation and less coverage within the species Xanthomonas campestris, which contain, among other landmarks, GIs, transposable elements, and phages (Lima et al., 2005, 2008).

Moreover, it was possible to point out some regions in XCC ATCC 33913, which had high ME values due to the presence of ribosomal genes, as in the case of regions 92 and 99 (Fig. 5a). Because of this, regions that seem more interesting should be further checked since they may not necessarily correspond to the specific characteristics of the respective organism. This situation has been reported in other studies. They found that rDNAs have a specific signature, unlike the host genome, being identified as atypical (Dufraigne et al., 2005). In our methodology, this problem can be solved for annotated genomes where a warning is given. Otherwise, all regions of interest should be carefully checked.

The case was the same when we analyzed the ME profiles from Bradyrhizobium sp. ORS 278, R. etli CFN 42, S. meliloti SM11, B. pseudomallei 1106a, H. influenzae Rd KW20, and P. aeruginosa DK2, where the genomic regions with high ME values had either the lowest coverage or the greatest variation among members of the same species and the same genus (Supplementary Fig. S4).

Particularly in the case of E. coli 536, the ME distribution presented the greatest variation throughout the genome (Supplementary Fig. S4). Despite this, most of the regions with ME values around the median are the most conserved (regions 33, 70, 83, and 94). This observation corroborates reports by previous studies about the members of this species. Interestingly, E. coli is one of the most diverse species of bacteria, with an equivalent of just 20% of the genome being common to all strains (Lukjancenko et al., 2010).

With this brief overview, we showed that the ME-based methodology reported herein may be a starting point in the analysis of genomes, directing the researcher to anomalous regions of potential scientific interest. The computational tool is easily operated, only requiring the target sequence, annotated or not, to be input in FASTA format and the results are derived fast.

4. Conclusions

The experimental results indicate that the proposed methodology, based on ME, is a suitable technique for analysis of individual genomes. In addition, it was possible to discriminate anomalous regions with ease and speed, without requiring previous annotation and comparison with other genomes. The ME method showed better results than the competing methods tested in terms of performance and execution time.

Such strategy may guide researchers in identification of potentially interesting target genomic regions so that research time and effort may be focused on a streamlined exploration of such sequences and the acquisition of important insights from them. Besides, it is the first time that ME is used to analyze bacterial genome sequences. Furthermore, ME can also be a novel approach in the early stages of analysis of newly sequenced genomes.

Footnotes

Acknowledgment

This work was supported by grants Fundação Araucaria (F.M.L.) and CNPq (F.M.L. and M.A.Van Sluys) and FAPESP to M.A.Van Sluys. G.A.B.-C. is the recipient of a PhD scholarship from CAPES (Coordenação de Aperfeiçoamento do Pessoal de Nível Superior).

Author Disclosure Statement

No competing financial interests exist.

References

Altschul

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Azad

R.K.

, and Lawrence

J.G.

2011. Towards more robust methods of alien gene detection. Nucl. Acids Res. 39, 56.

Baker

2010. Next-generation sequencing: Adjusting to data overload. Nat. Methods, 7, 495–499.

Bohlin

, and Skjerve

2009. Examination of genome homogeneity in prokaryotes using genomic signatures. PLoS One, 4, 8113.

Casjens

1998. The diverse and dynamic structure of bacterial genomes. Annu. Rev. Genet., 32, 339–377.

da Silva

A.C.

, Ferro

J.A.

, Reinach

F.C.

, et al. 2002. Comparison of the genomes of two Xanthomonas pathogens with differing host specificities. Nature, 417, 459–463.

Darmon

, and Leach

D.R.

2014. Bacterial genome instability. Microbiol. Mol. Biol. Rev., 78, 1–39.

Dufraigne

, Fertil

, Lespinats

, et al. 2005. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucl. Acids Res. 33, 6.

Garcia-Vallve

, Guzman

, Montero

M.A.

, et al. 2003. HGT-DB: A database of putative horizontally transferred genes in prokaryotic complete genomes. Nucl. Acids Res., 31, 187–189.

10.

Guiasu

, and Shenitzer

1985. The principle of maximum entropy. Math. Intell., 7, 42–48.

11.

Hsiao

, Wan

, Jones

S.J.

, et al. 2003. IslandPath: Aiding detection of genomic islands in prokaryotes. Bioinformatics. 19, 418–420.

12.

Jaynes

E.T.

1957a. Information theory and statistical mechanics. Phys. Rev., 106, 620–630.

13.

Jaynes

E.T.

1957b. Information theory and statistical mechanics II. Phys. Rev., 108, 171–190.

14.

Karlin

, and Burge

1995. Dinucleotide relative abundance extremes: A genomic signature. Trends Genet. 11, 283–290.

15.

Lima

W.C.

, Paquola

A.C.

, Varani

A.M.

, et al. 2008. Laterally transferred genomic islands in Xanthomonadales related to pathogenicity and primary metabolism. FEMS Microbiol. Lett., 281, 87–97.

16.

Lima

W.C.

, Van Sluys

M.A.

, Menck

C.F.

2005. Non-gamma-proteobacteria gene islands contribute to the Xanthomonas genome. OMICS, 9, 160–172.

17.

Loman

, Constantinidou

, Chan

, et al. 2012. High-throughput bacterial genome sequencing: An embarrassment of choice, a world of opportunity. Nat. Rev. Microbiol., 10, 599–606.

18.

Lukjancenko

, Wassenaar

, Ussery

2010. Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol., 60, 708–720.

19.

Mantri

, and Williams

K.P.

2004. Islander: A database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities. Nucl. Acids Res., 32, 55–58.

20.

Monteiro-Vitorello

C.B.

, de Oliveira

M.C.

, Zerillo

M.M.

, et al. 2005. Xylella and Xanthomonas Mobil'omics. OMICS, 9, 146–159.

21.

Pagani

, Liolios

, Jansson

, et al. 2012. The genomes online database (GOLD) v.4: Status of genomic and metagenomic projects and their associated metadata. Nucl. Acids Res., 40, 571–579.

22.

Powers

D.M.W.

2011. Evaluation: From precision, recall and F-Measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Tech., 2, 37–63.

23.

Shannon

C.E.

1948. A mathematical theory of communication. Bell. Syst. Tech. J., 27, 379–423.

24.

Van Sluys

M.A.

, Monteiro-Vitorello

C.B.

, Camargo

L.E.A.

, et al. 2002. Comparative genomic analysis of plant-associated bacteria. Annu. Rev. Phytopathol., 40, 169–189.

25.

Vernikos

G.S.

, and Parkhill

2006. Interpolated variable order motifs for identification of horizontally acquired DNA: Revisiting the Salmonella pathogenicity islands. Bioinformatics, 22, 2196–2203.

26.

Waack

, Keller

, Asper

, et al. 2006. Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics, 7, 142.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

0.11 MB

0.20 MB