Automation on the Generation of Genome-Scale Metabolic Models

Abstract

Nowadays, the reconstruction of genome-scale metabolic models is a nonautomatized and interactive process based on decision making. This lengthy process usually requires a full year of one person's work in order to satisfactory collect, analyze, and validate the list of all metabolic reactions present in a specific organism. In order to write this list, one manually has to go through a huge amount of genomic, metabolomic, and physiological information. Currently, there is no optimal algorithm that allows one to automatically go through all this information and generate the models taking into account probabilistic criteria of unicity and completeness that a biologist would consider. This work presents the automation of a methodology for the reconstruction of genome-scale metabolic models for any organism. The methodology that follows is the automatized version of the steps implemented manually for the reconstruction of the genome-scale metabolic model of a photosynthetic organism, Synechocystis sp. PCC6803. The steps for the reconstruction are implemented in a computational platform (COPABI) that generates the models from the probabilistic algorithms that have been developed. For validation of the developed algorithm robustness, the metabolic models of several organisms generated by the platform have been studied together with published models that have been manually curated. Network properties of the models, like connectivity and average shortest mean path of the different models, have been compared and analyzed.

1. Background

Since the second half of the twentieth century, the development of molecular biology has allowed fast advances in the understanding of the functions and working principles of cells and unicellular organisms at molecular level. In particular, high throughput experimental techniques of sequencing, and analysis of genomic and proteomic information, has given birth to rich web-based databases of biological information on thousands of organisms, from prokaryotic bacteria to complex organisms like birds and mammals.

One of the new research fields that emerges from this panorama is systems biology (Snoep et al., 2006): the bottom-up approach to quantitatively explaining the properties of biological systems from the modeling and simulation of the interactions and characteristics of its macromolecular components. From these systematic studies based on mathematical modeling and computational simulations, a new discipline appears: synthetic biology, which is focused on the design and construction of “a la carte” new biological entities with new biological functions (Heinemann and Panke, 2006). This new field of biotechnology lies in the limit between biology and engineering and aims at the partial design of modified organisms for different technological applications. For this purpose, rational design principles of engineering must be combined with the available biological information and biotechnological techniques. The huge amount of biological information and the complexity of the engineering principles and analysis tools make evident the need for a good computational platform that aids in the data mining and design development for artificial biological systems.

One of the cornerstones of systems biology is the reconstruction of genome-scale metabolic models. This means to gather information on all enzymatic reactions that take place in an organism based on genetic information available for this organism's genome. The current state of art of this process requires the effort of a specialist during a long-term period (usually one year) in order to collect the available information from many different databases and the literature. Currently there are few software applications specially designed to help in this specific task. Moreover, different available software applications are too specific for a few determined tasks that do not embrace all steps in the whole process, leaving huge gaps that must be filled manually by the researcher.

The genome-scale metabolic reconstruction is the starting point of many different researches and applications, like the determination of the metabolic capacities or the determination of protocols for an optimal growing strategy for some organism, and in particular, the search for potential sites for metabolic engineering (Oberhardt et al., 2009; Patil et al., 2004). The aim of metabolic engineering is the modification and/or introduction of biochemical reactions with technologies, like recombining DNA, to optimize the production of some metabolites of interest in order to redirect metabolic fluxes to new pathways or even extend the metabolic capacities of an organism for the production of new metabolites. The accomplishment of these objectives for a specific organism depends on a good reconstruction of its metabolism, from which one can study the structure of the metabolic network and the consequences of adding or deleting specific genes (Edwards et al., 2001).

In this context, several projects have been developed for the reconstruction of genome-scale metabolic models with different ends, like the production of fuel from cyanobacteria (Montagud et al., 2010) or yeast. Other examples are the genome-scale metabolic reconstruction of the Burkhoderia cenocepacia J2315 (Fang et al., 2011) for research of treatment in patients suffering from cystic fibrosis; the Rhodobacter sphaeroides (Imam, 2011), capable of producing hydrogen, polyhydroxybutyrate, and other biofuels; the Clostridium beijerinckii (Milne, 2011) capable of producing butanol.

Genome-scale metabolic models result from the integration of genomic, proteomic, and metabolomic information obtained at different experimental levels. The study of these theoretical reconstructions of cell metabolism allows researchers to investigate emergent phenomena in biology, like the feedback control loops that regulate the organisms and other aspects of metabolic and genetic transcription and regulation. The metabolic models constitute an important tool for the comprehension of an organism, its metabolic capacities, and prediction of its responses to different environmental and genetic changes. Moreover, they facilitate the development of strategies for the engineering of metabolic systems focused on the improvement of metabolic efficiency.

The process, in order to generate the models, consists of a first step to collect all available information on the metabolome of a species, as well as all genes that code for different enzymes that catalyze each one of the metabolic reactions that take place in cell metabolism. Other aspects that must be taken into account are the coenzymes and cofactors needed for the enzymatic catalysis, the stoichiometry and reversibility of the reactions and information on the biomass composition and metabolic regulation (Förster et al., 2003). Among all possible applications of a metabolic model, there is the possibility of evaluating projects for production and optimization of a metabolite of interest (Oberhardt et al., 2009). If a model is satisfactorily constructed, it should allow a realistic simulation of the organism's metabolism, submitted to different environmental and genetic perturbations. This simulation would represent, with its natural restrictions, a virtual organism or an “in silico” cell in which one can apply different computational algorithms to explore possible flux distributions inside the cell subjected to different environmental conditions and genetic configurations (Montagud et al., 2010). For the analysis of metabolic models, there are several computational tools and algorithms already developed (Edwards et al., 1999). Those include flux balance analysis (FBA) (Varma and Palsson, 1993, Edwards et al., 1999), minimization of metabolic adjustments (MOMA) (Segre et al., 2002) and metabolic flux analysis (MFA) (Schilling et al., 1999; Varma and Palsson, 1994) among others.

Currently, researchers that work in the genome-scale metabolic reconstruction use different computational tools in order to accomplish different tasks. For example, in order to generate the list of biochemical reactions for a specific organism, the software Pathway Tools (Karp et al., 2002) is a common choice. This software allows the compilation of specific databases for proteins, enzymes, and metabolites of a determined organism, and to obtain a rough draft for the network. Nevertheless, it might be said that this software obtains the list of reactions with no regard to associations between metabolic routes and the reactions, or to completeness and unicity criteria for such models. These drawbacks leave a considerable amount of work to be done manually by the biologist, who has to consider the reversibility of all reactions and the analysis of possible repeated reactions inside the model, as well as the inclusion of reactions studied for other organisms that complete specific metabolic pathways but do not show up in the database due to gaps in the genomic annotation for the organism at hand. Other specific software might be found with different computational algorithms implemented, like the Optgene software that includes an evolutionary programming-based method to rapidly identify gene deletion strategies for optimization of a desired phenotypic objective function (Patil et al., 2005).

The lengthy work of reconstructing a genome-scale metabolic network and analyzing it would be much faster and greatly simplified if one could find all algorithms and computational tools needed in the same software or platform. The aim of this work is to present a platform developed by our research group in order to automatically generate genome-scale metabolic models. With respect to other software that might be found for this purpose, our platform has the advantage that it automatically takes into account the criteria for unicity of the biochemical reactions and presents the possibility for the biologist to automatically complete gapped metabolic pathways based in probabilistic criteria and comparison of the same metabolic route in different organisms. Moreover, the platform produces the metabolic generated network in different outputs: either as an SBML file or directly as an OptGene file format that might be directly piped in other analysis software. For the generation of the OptGene file, the biologist is also given the possibility to choose the biomass composition among the metabolites appearing in the model and directly fulfill the restrictions for the flux analysis and balance.

The work is divided as follows. In the next section, we explain the algorithm for the generation of genome-scale metabolic models implemented in a web-based platform. The algorithm automatically obtains information from the KEGG database for a specific organism and constructs from it the list of reactions in its metabolism. Criteria of unicity and completeness are taken into account in order to cope with different enzymes catalyzing the same reactions or to fill in missing reactions. Section 3 presents an analysis done with the models generated by the platform. The generated OptGene files are used as inputs in different algorithms to study network properties of the reconstructed metabolic models and compare them with manually reconstructed models taken from the literature. In the end, we present our conclusions, an overview, and future perspectives.

2. Methods

The algorithm for automatically generating metabolic models comprises several steps. The information compilation from free-access biological databases, following some interaction of the user with the platform in order to properly select the parameters for the probabilistic criteria and choices for the biomass components and restrictions, and finally application of unicity and completeness criteria and production of the output.

2.1. Obtaining the biological information

In the last years, applications of biotechnology in different areas of science and technology have considerably increased, causing an exponential growth on the available information about different organisms on their genetics, regulation processes, and metabolism. Such information, obtained by different techniques with growing efficiency, becomes part of huge databases, many of which are free to access. This information, together with the vast published scientific work, pus in the hands of researchers a rich, ever-growing amount of data and information.

A starting point for the genome-scale metabolic reconstruction is to obtain the relevant information about the organism for which the model is going to be generated, namely the list of reactions, genes, metabolites, and enzymes present in the studied cell. This information is available from public free-access databases like Biocyc (Karp et al., 2005), KEGG (Kanehisa et al., 2008), Brenda (Chang et al., 2009), Uniprot (Uniprot, 2008), etc. Nevertheless, the lack of quality in some entries of the databases are an inconvenience that one must amend: false positives, false negatives, as well as objects wrongly annotated, may pose obstacles in the efforts to compile a meaningful correct list of reactions (Weise et al., 2006). As a consequence, the reconstruction must be done under strict control of all the reactions, the biomass equation must be based in constituent molecules, and coherence and integrity of the network must be prerequisites for the generation of a quality and useful model (Feist et al., 2009).

The first objective of the project is to obtain the relevant rough information. For this aim, we have chosen to use KEGG API web service offered by the KEGG database. This service allows one to access the KEGG system via SOAP/WSDL, which offers valuable tools in order to access the available information in the above-mentioned database. These tools are designed for the search of cellular biochemical processes as well as analyzing the universe of genes and completed genomic sequences of thousands of organisms. The users might access KEGG API server by the simple object access protocol (SOAP) technology through HTTP protocol.

SOAP defines how two objects in different processes may communicate through exchange of XML (extensible markup language) data for a variety of bioinformatic applications. With this protocol, an application running in a machine anywhere in the world can use algorithms, data, and resources stored in different servers (Snell et al., 2001). Web services description language (WSDL) is based in XML and allows one to have the description of a web service, specifying the abstract interface through which a client may access the service and the details on how to use it (Christensen et al., 2001).

These technologies allowed us to access KEGG API and to construct a service web client using Java (Horstmann, 2010) as programming language and Netbeans (Myatt, 2008) 6.8 as integrated development environment. This way, we obtained biological information from the definition of directional data model that relates biological elements for its storage in a database constructed in Postgres (Smith, 2010; Bourne, 2005; Altman, 2004), taking into account the relationships among them and their importance in the reconstruction of the genome-scale metabolic models. The relevant biological elements identified at this point are the organism, gene, pathway, enzyme, reaction, compound (metabolites and glycans), and the references. In Figure 1, we show a schema of how information is organized and related in the KEGG database, as well as the methods to obtain the relationships between the different elements.

FIG. 1.

Scheme of KEGG information with the WSDL method for obtaining information. The same methods that appear referencing compounds also exist for glycans (which are not shown in the figure).

2.2. Computational platform for the access of biological information (COPABI)

All the biological databases available on the internet offer a web application to display their information, as well as methods for exporting this information in different forms. The first aim of our project is to implement an application that allows biologists to consult the relevant information from the database and automatically generate metabolic models from it, where probabilist criteria for unicity and completeness are taken into account in order to generate more accurate models with greater quality in an efficient way. The platform resulting from the project is named COPABI (from computational platform for the access of biological information).

Every web application requires for its publication a web server, which is responsible for taking client requests and answering them. In the case of COPABI, the chosen web server application is Apache, an open-source code that works under any platform and has become one of the best web servers in terms of efficiency, functionality, and speed (Tong, 2008).

In order to select the tools for the implementation of COPABI, it has been taken into account that among the possible web applications there are two major groups: the languages that run from the side of the server, like PHP, and the ones that run from the side of the client, like HTML (Castro, 2006), Javascript (Stefanov, 2010), etc. In the case of COPABI, PHP has been chosen because it is an open-source and very popular programming language, especially in web development (Gutmans et al., 2004; Schlossnagle, 2007). As web application framework (WAF), we have used Codeigniter 1.7.3 (Upton, 2007). The aim of this framework is to aid in the development of code, from a rich sample of libraries, a simple interface, and a logical structure for accessing the libraries. Javascript has been used for interactivity of the pages and for validations and the completion of data in the search for biological information.

2.3. Formats for the output

Different analysis tools in systems biology make use of different file formats. Therefore, a good choice for the output file format will be determined by the uses and utility of the generated models. Some examples of file formats used by different databases are: SBML (System Biology Markup Language) (Hucka et al., 2003), FASTA (Fast All) (LLC Books, 2010), BLAST (Basic Local Alignment Search Tool) (Camacho et al., 2008), and KEGG presents its information in a particular markup language called KGML (Kanehisa and Goto, 2000).

The lack of a standard makes the use of different softwares that usually have specific formats for their inputs and output more difficult. Sometimes the user has to go through the lengthy work of converting the information from one format to the other. One of the most versatile formats is certainly SBML, since it is a description language based on XML that can be used to represent models of different biological processes like metabolic network, cellular signaling pathways, genetic regulation networks, among others. A very useful file format specific for metabolic networks is the one following the input requisites for the OptGene software (Patil et al., 2005), also called BioOpt and used in BioMet toolbox (www.sysbio.se/biomet).

For the COPABI platform, these two output file formats have been chosen: the standard SBML level 2, version 1, and the OptGene file format, which can be readily used for flux balance analysis.

2.4. Construction of the models

For the generation of metabolic models for any organism, the applied methodology follows the same steps used in the manual reconstruction of the first model of a photosynthetic organism, the Synechocystis sp. PCC6803 (Montagud et al., 2010), also taking into account the probabilist criteria followed in this article in order to deal with duplicated and missing reactions, which will be discussed later.

About the list of reactions in the reconstruction of the metabolic model, before applying the probabilist criteria one has to properly organize and identify the chemical reactions. Three issues should be noted here:

• First is the compilation of all chemical reactions of a particular pathway present in the organism. There are two kinds of reactions that one has to take into account: most reactions are catalyzed by enzymes and each enzyme has a code called EC (enzyme commission); each reaction in the model receives this number as its identifier. On the other hand, there are a few reactions in some pathways that are not catalyzed by enzymes; they are spontaneous reactions instead. These spontaneous chemical reactions receive as their identifier the name “non-enzymatic” and a number after it, indicating the order in which these reactions appeared.

• Next is related to enzymes that can catalyze different reactions of the same type. This usually happens because different substrates have very similar structure and can couple to the enzyme that performs its catalytic activity. In these cases, next to the EC number of the enzyme, in the identifier there's an extra letter (a,b,c, … ).

• In a last step, the reversibility (or irreversibility) of each reaction in a pathway is checked from KGML files for every pathway present in an organism.

Example of output from Synechocystis sp. PCC6803 in OptGene format follows (the symbol # indicates comments, and we use it for specifying the name of the metabolic pathway):

-REACTIONS

# Glycolysis / Gluconeogenesis

1.2.4.1a: Pyruvate + Thiamin diphosphate -> CO2 + 2-(alpha-Hydroxyethyl)thiamine diphosphate

2.7.1.40a: ATP + Pyruvate -> ADP + Phosphoenolpyruvate

6.2.1.1a: ATP + CoA + Acetate -> Diphosphate + AMP + Acetyl-CoA

1.2.1.5a: H2O + NAD+ + Acetaldehyde <-> NADH + Acetate + H+

1.1.1.2: NADP+ + Ethanol <-> NADPH + H+ + Acetaldehyde

# Citrate cycle (TCA cycle)

1.1.1.42a: Oxalosuccinate <-> CO2 + 2-Oxoglutarate

1.1.1.37: NAD+ + (S)-Malate -> NADH + Oxaloacetate + H+

2.3.3.1: CoA + Citrate <-> H2O + Acetyl-CoA + Oxaloacetate

6.2.1.5a: ATP + CoA + Succinate <-> ADP + Orthophosphate + Succinyl-CoA

2.5. Probabilistic criteria

Until this point, the reconstructed model will have exactly the same information stored in the database. The next steps are the implementation of automatic algorithms that will take into account the probabilistic criteria in order to complete missing gaps in some metabolic pathways (completeness) or to exclude duplicated reactions (unicity).

2.5.1. Unicity

For the unicity criteria, the algorithm identifies reactions that appear more than once and also identifies their enzymes. Repeated reactions must be eliminated, and the criteria to choose which reaction is eliminated follows: The enzyme that appears less frequently in the model is not eliminated. As an example, we show the reactions:

1.2.1.12a: A + B <-> C + D

1.2.1.12b: G + E <-> K + L

1.2.1.12c: P + V -> Y

1.1.1.1: G + E <-> K + L

In the example, enzyme EC1.2.1.12 catalyzes three different reactions, one of which is the same as the reaction catalyzed by enzyme EC1.1.1.1. Following the algorithm criteria, reaction 1.2.1.12b is eliminated from the metabolic model.

2.5.2. Completeness

The addition of new reactions to a metabolic model is associated with the comparison between the information available about the metabolic reactions in a determined pathway of a specific organism with a general pathway (theoretical one) generated from the compilation of all metabolic reactions present in “all” organisms in nature. In many cases, the genomes are not perfectly annotated and some genes are missing from the annotation. This comparison of pathways is a tool to help biologists identify these missing genes.

The reactions associated with gaps identified in a particular pathway are going to be added to the metabolic model if they satisfy the following criteria:

• The reactions have as final product a metabolite belonging to the biomass equation.

• The reactions present in the model in this pathway correspond to, at least, a determined percentage of the whole general pathway.

The COPABI interface allows the user to choose the metabolites that compose the biomass equation, as well as the percentage value used in the second criteria.

The reactions added to the model with these criteria are added to the end of the output after a comment (# not pres ! ∼∼IMPORTANT, following X reactions not in sequence!!∼∼), and their identifier is the correspondent EC number with the symbol “·” preceding it.

3. Results and Discussion

For the validation of the metabolic models generated by the COPABI platform, we have analyzed the models generated for determined organisms with manually curated metabolic models taken from the literature. First, we analyze general properties of the generated models (number of metabolites, reactions, … ) and properties of the networks described by the metabolic model.

From the network point of view, each metabolite of a model can be thought as a node and each reaction represents links between the metabolites in the left-hand side with the metabolites on the right-hand side of the reaction equation. These links can be directed if one takes into account the direction of each reaction and the reversibility of the reactions, or undirected if one neglects this information.

In a first step of the analysis, our algorithms generates an automatic debug of the metabolic models. This means looking for bad reactions, without substrates or products (some transport reactions in the SBML files taken from the literature presented in this issue), reactions in which the same metabolite appears as substrate and product or reactions decoupled from the network, meaning that at least one substrate and one product of the reaction appears only in this reaction and nowhere else. All these bad reactions are excluded from the models before any calculation is done, because these bad reactions add up errors and uncertainties to the results.

Each metabolic model now represents a network, and as a first step, we have chosen to work with the directed version of it, meaning that the links connecting two metabolites have direction from the substrate to the product, and in reversible reactions, the pair of metabolites would have two links from opposite directions connecting them.

For each organism, two versions of the metabolic model have been generated with two different values for the parameter appearing in the decision-making process for the completeness criteria, explained in the previous section. In one model, the parameter is chosen to be 100% and in the other 10%, two possible extremes. Taking this parameter to be 100% means that only reactions for enzymes that are annotated in the genome of an organism will be written to the model. On the other hand, if the parameter is taken to be 10% (a fairly low value), it means that if one out of ten reactions in a pathway are annotated in the genome, the generated model will have all reactions in this pathway. In the end, for each organism there are three models being studied, two automatically generated by the COPABI platform, and one manually constructed as taken from published works. The models taken from the literature correspond to the following organisms: the Synechocystis sp PCC6803 (Montagud et al., 2010), Synechococcus elongatus PCC7942 (Triana et al., in preparation), Burkhoderia cenocepacia J2315 (Fang et al., 2011), Rhodobacter sphaeroides (Imam 2011), Clostridium beijerinckii (Milne, 2011), Mycoplasma genitalium (Suthers et al., 2009), Lactobacillus plantarum (Teusink et al., 2006), Thermotoga maritima (Zhang et al., 2009) and Yerisinia pestis (Navid and Almaas, 2009).

In Table 1, we show results for the general comparison of the models. As one can see from this table, although the networks usually have hundreds of different metabolites, two different metabolites are on average only three steps apart from each other. As a consequence, the whole network should very quickly respond to changes in any of the metabolites' concentrations or to environmental perturbations. This closeness of the nodes in the network is known as small-world behavior and is a consequence of a property of the network connectivity know as scale-free distribution. Metabolic networks are known to follow a free-scale distribution for node connectivity, meaning that the number of nodes P with some number of connections x follows a power law: P(x) ∼ x^-γ where γ is usually a number between 2 and 3. From this law, one concludes that there are very few nodes with a large number of connections (these are called hubs), and most of the nodes have very few connections.

Table 1.

General Comparison

Org.	# Met.	# Reac.	% Rev.	% Irr.	ASP	σ_ASP	N_R	N_U
Syn_lit	803	893	34.49	65.51	3.51	1.15	494446	150363
Syn_10	707	718	37.74	62.26	3.2	0.90	355430	144419
Syn_100	656	640	36.40	63.60	3.29	0.94	295093	135243
Syf_lit	777	847	36.01	63.99	3.55	1.19	475612	128117
Syf_10	711	705	37.02	62.98	3.19	0.88	356066	149455
Syf_100	655	622	35.05	64.95	3.32	0.95	292390	136635
Cbe_lit	732	856	27.22	72.78	3.05	0.82	409910	125914
Cbe_10	752	808	40.22	59.78	3.21	0.88	412228	153276
Cbe_100	693	733	38.2	61.8	3.33	0.97	335276	144973
Tma_lit	583	612	41.67	58.33	3.19	0.96	242290	97599
Tma_10	566	614	46.09	53.91	3.06	0.83	250504	69852
Tma_100	489	517	44.1	55.9	3.24	0.91	183170	55951
Bcj_lit	792	847	27.63	72.37	3.04	0.83	523487	103777
Bcj_10	955	1018	37.03	62.97	3.25	0.88	632355	279670
Bcj_100	907	948	36.29	63.71	3.32	0.92	564967	257682
Mge_lit	342	262	40.08	59.92	3.00	0.99	83279	33685
Mge_10	268	254	48.82	51.18	2.89	0.82	54311	17513
Mge_100	116	104	55.77	44.23	3.42	1.25	11543	1913
Eco_lit	1034	1435	14.91	85.09	3.19	0.90	912770	156386
Eco_10	888	1017	39.53	60.47	3.19	0.84	576962	211582
Eco_100	846	968	38.95	61.05	3.26	0.87	525007	190709
Lpl_lit	513	526	31.37	68.63	2.97	0.82	221786	41383
Lpl_10	566	595	41.85	58.15	3.14	0.83	233640	86716
Lpl_100	492	512	41.8	58.2	3.23	0.88	173827	68237
Rsp_lit	788	863	64.31	35.69	2.74	0.68	593663	27281
Rsp_10	869	934	41.65	58.35	3.16	0.82	543333	211828
Rsp_100	827	873	40.21	59.79	3.24	0.87	485209	198720
Ypk_lit	817	948	29.85	70.15	3.04	0.85	339398	142238
Ypk_10	838	945	39.47	60.53	3.18	0.84	520075	182169
Ypk_100	779	891	39.62	60.38	3.25	0.89	444404	162437

In column # Met., one can see the number of different metabolites (or compounds) found in each model, and in column # Reac, the number of reactions present in each model (after excluding bad reactions). In columns % Rev and % Irr, we show the percentage of the reactions that are reversible and irreversible, respectively. Column ASP has the calculated average shortest path (for each pair of metabolites in the model we have used Dijkstra's algorithm to calculate the shortest path connecting them in the network) and column σ_ASP has the standard deviation for the ASP calculation. Columns N_R and N_U show the number of pair of metabolites for which a path connecting them was and was not found, respectively (one should note that the network is directed, so metabolites that have no link pointing in their direction can not be reached by any pair and are therefore either external metabolites that should be absorbed by the cell from the environment or badly incorporated into the model).

When studying node connectivity, for each metabolite an algorithm counts how many reactions it appears as a substrate (or product in reversible reactions). In Figure 2, we show results for the three metabolic models of some organisms. It is clear from these plots the tendency of the distribution to follow a power law.

FIG. 2.

Connectivity distribution for the three metabolic models in different organisms.

As one can see, all metabolic networks are very similar when studying their global network properties. Therefore, in order to be able to differentiate the metabolic network of one organism from the network of another organism, one has to look into the details of the networks, namely specific metabolites and hubs that are particular for each organism.

In order to find these differences, we are going to define a similarity parameter to measure the degree of similarity of two metabolic networks. Two criteria are taken into account in the definition of this parameter: first, the metabolites present in each metabolic network and the degree (number of connections) of each metabolite to all others. Since important metabolites for one organism might be different from the essential metabolites in another, we would like to take into account the identity of the metabolites in the different connections and not only the number of connections. This step is rather tricky, because the metabolites' names used in the metabolic models taken from literature do not follow any standard, and the authors of each model chose different abbreviations and names for each compound. For some models, however, the authors have also made available the association of each compound name used in their models with a KEGG identifier. For these models, we were able to construct an algorithm that translated the metabolic model to the same standard metabolite names used by KEGG, and therefore, we were also able to compare the metabolite identities.

3.1. Network comparison

Our goal here is to define a distance between two networks. We call it a distance in the sense that the bigger the value of this number, the more the two networks differ, while the smaller the number, the closer the networks are.

Given two metabolic networks, each one has a set of metabolites (let's call the set in each network by set A and set B). Among all metabolites in the two networks, there are three different sets: metabolites particular to network A, metabolites particular to network B, and metabolites common to the two networks: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}A \cup B = \underbrace{ ( A \cap \bar{B} ) }_{ \rm only \ in \ A} \cup \underbrace{ ( A \cap B ) }_{ \rm Common} \cup \underbrace{ ( \bar{A} \cap B ) }_{ \rm Only \ in \ B} \tag{1}\end{align*}\end{document}

Now, let's consider the connections of the metabolites. Each metabolite i has n_i connections in total and n_αi connections only to metabolites in the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$A \cap \bar{B}$$\end{document} , n_βi connections only to metabolites in the set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$B \cap \bar{A}$$\end{document} , and n_γi connections only to metabolites in the set A ∩ B.

Let's define the number of metabolites in each set and the total number of connections inside each set: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_ \alpha & = \mid \mid A \cap \bar{B} \mid \mid \tag{2}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_ \beta & = \mid \mid B \cap \bar{A} \mid \mid \tag{3}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_ \gamma & = \mid \mid A \cap B \mid \mid \tag{4}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_A & = \sum_{i \in A \cap \bar{B}} n_i \tag{5}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_B & = \sum_{i \in B \cap \bar{A}} n_i \tag{6}\end{align*}\end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}N_C & = \sum_{i \in A \cap B} n_i \tag{7}\end{align*}\end{document}

Here, ∥C∥ means the number of elements in the set C.

Now, for each set, let's sum the proportion of connections of each metabolite to metabolites inside the set, weighted by the inverse of the total number of connections and averaged for all metabolites. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} p_ { A i } & = \frac { n_ { \alpha i } } { n_i } \\ \alpha & = \frac { N_A } { N_ \alpha } \sum_ { i \in A \cap \bar { B } } \frac { 1 } { n_i } p_ { A i } \end{align*}\end{document}

Analogously, we define β and γ for the metabolites in the other two sets.

The distance between the two networks is defined as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} { \rm dist } = \frac { \alpha + \beta } { 2 \gamma } \end{align*}\end{document}

For an identical network, α and β are zero, so dist = 0. For two networks that have not one single metabolite in common, γ = 0 and so dist = ∞.

For validating our models, we proceeded by doing this rough comparison, calculating the distance between each one of the automatically generated models to each model taken from the literature. In Table 2, we show results for the comparisons between five models taken from the literature with the automatically generated ones. As one can observe, the smallest value of the comparison in each column is when comparing the literature model with the automatically generated one for the corresponding organism.

Table 2.

Distance Comparison Between Automatically Generated and Manually Curated Models

Org	Syn_lit	Syf_lit	Cbe_lit	Rsp_lit	Ypk_lit	Tma_lit
Mge_10	1.246	1.254	0.401	0.695	1.003	1.541
Lpl_10	0.815	0.755	0.121	0.317	0.476	0.834
Syn_10	0.47	0.527	0.183	0.248	0.626	1.065
Syf_10	0.54	0.496	0.18	0.255	0.628	0.999
Cbe_10	0.697	0.699	0.076	0.212	0.413	0.814
Bcj_10	0.708	0.721	0.156	0.183	0.459	1.063
Eco_10	0.748	0.799	0.13	0.204	0.387	0.959
Tma_10	0.72	0.7	0.103	0.278	0.498	0.636
Rsp_10	0.735	0.741	0.157	0.138	0.549	1.103
Ypk_10	0.772	0.782	0.12	0.181	0.324	0.882

In boldface are the smallest number in each row, showing that the best comparison between models is when comparing models of the same species.

This is a rough comparison, since the identification of the metabolites' names is not perfect. Moreover, since the automatically generated models have not yet been used in flux balance analysis, there is no distinction between internal and external metabolites, and we used a version of these models where no biomass was defined. This introduces a lot of errors and uncertainties in the comparison made, because the models from the literature do have these features defined, and they appear, from the point of view of the comparing algorithm, as new and different metabolites for which there will be no counterpart in the automatically generated models. Despite these pitfalls, the features already contained in the models are enough to differentiate the various organisms when compared to models from the literature.

4. Conclusions

We presented here the metabolic models automatically generated by an algorithm developed by our research group. This algorithm, implemented in the COPABI platform, is capable of automatically downloading genomic, metabolomic, and proteomic information from the KEGG database, and constructing, from this information, a genome-scale metabolic model either in OptGene or XML (SBML) file formats. The tools presented here will soon be made available on the internet.

The metabolic models generated have been throughly analyzed by standard algorithms in order to calculate average shortest mean path between nodes in the network and connectivity distribution. Commonly observed features in these networks are small-world behavior and scale-free distribution for the nodes degree.

After a general analysis of the global properties of the networks, we proceeded to compare the automatically generated models with manually curated models from published works. For this task, a distance parameter between metabolic networks has been defined. The comparison shows that the automatically generated models are consistent with manually constructed models found in the literature.

The reconstruction of genome-scale metabolic models is an important step in different areas of research related to systems biology. Nowadays, this is a lengthy and slow process that might take over a year to be completed for a single organism. The algorithm developed here will certainly speed up the process and help researchers gain access to reconstructed models for any organism in just a few days, time.

Footnotes

Acknowledgments

The authors would like to thank Alberto Conejero for his useful discussions. This work has been funded by MICINN TIN2009-12359 project ArtBioCom from the Spanish Ministerio de Educación y Ciencia.

Disclosure Statement

The authors declare that no competing financial interests exist.

References

Altman

R.B.

2004. Building successful biological databases. Brief. Bioinformatics, 5:4–5.

Bourne

2005. Will a biological database be different from a biological journal? PLoS Comput. Biol, 1:179–181.

Camacho

, Madden

, Coulouris

et al. 2005. BLAST Help. NCBI Help Manual. National Center for Biotechnology Information (US): Bethesda, MD.

Castro

2006. HTML, XHTML, and CSS. Sixth Visual QuickStart Guide. 6th. Peachpit Press: San Francisco, CA.

Chang

, Scheer

, Grote

et al. 2009. BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009. Nucleic. Acids. Res., 37:588–592.

Christensen

E.F.C.

, Meredith

, Weerawarana

2001. Web Services Description Language (WSDL) 1.1. www.w3.org/TR/wsdl.

Edwards

J.S.

, Ramakrishna

, Schilling

C.H.

et al. 1999. Metabolic flux balance analysis. Metabolic Engineering. Marcel Dekker Inc.: New York, 80–120.

Edwards

J.S.

, Ibarra

R.U.

, Palsson

2001. In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nat. Biotechnol, 19:125–130.

Fang

, Zhao

, Sun

et al. 2011. Exploring the metabolic network of the epidemic pathogen Burkholderia cenocepacia J2315 via genome-scale reconstruction. BMC syst. biol., 5:83.

10.

Feist

A.M.

, Herrgard

M.J.

, Thiele

et al. 2009. Reconstruction of biochemical networks in microorganisms. Nat. Rev. Microbiol., 7:129–143.

11.

Förster

, Famili

, Fu

et al. 2003. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res, 13:244–53.

12.

Gutmans

, Bakken

, Rethans

2004. PHP 5 Power Programming. Prentice Hall: New York.

13.

Heinemann

, Panke

2006. Synthetic biology; putting engineering into biology. Syst. biol., 22:2790–2799.

14.

Horstmann

C.S.

2010. Big Java 4th Edition for Java 7 and 8, International Student Version. Wiley: Hoboken, NJ.

15.

Hucka

, Finney

, Sauro

H.M.

et al. 2003. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19:524–531.

16.

Imam

S.Y.

2011. iRsp1095: A genome scale reconstruction of the Rhodobacter sphaeroides metabolic network. BMC Systems Biology, 5:116.

17.

Kanehisa

, Goto

2000. KEGG: kyoto encyclopedia of genes and genomes. Nucl. Acids Res., 28:27–30.

18.

Kanehisa

, Araki

, Goto

et al. 2008. KEGG for linking, genomes to life and the environment. Nucleic Acids Res, 36:480–484.

19.

Karp

P.D.

, Paley

, Romero

2002. The Pathway Tools software. Bioinformatics, 18:S225–S232.

20.

Karp

P.D.

, Ouzounis

C.A.

, Moore-Kochlacs

et al. 2005. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res, 19:6083–6089.

21.

LLC Books. 2010. Biological Sequence Format: Fasta Format, Stockholm Format, Fastq Format. General Books LLC: Memphis, TN.

22.

Milne

C.B.

, Eddy

J.A.

, Ravali

et al. 2011. Metabolic network reconstruction and genome scale model of butanol-producing strain Clostridium beijerinckii NCIMB 8052. BMC Syst. Biol., 5:130.

23.

Montagud

, Navarro

E. F.

, de Córdoba

et al. 2010. Reconstruction and analysis of genome-scale metabolic model of a photosynthetic bacterium. Syst. Biol., 4:156.

24.

Myatt

2008. Pro Netbeans IDE 6 Rich Client Platform Edition, First. Apress: New York, NY.

25.

Navid

, Almaas

2009. Genome-scale reconstruction of the metabolic network in Yersinia pestis, strain 91001. Mol. Bio Syst., 5:368–375.

26.

Oberhardt

M.A.

, Palsson

B.O.

, Papin

J.A

. 2009. Applications of genome-scale metabolic reconstructions. Mol. Syst. Biol., 5:320.

27.

Patil

K.R.

, Akesson

, Nielsen

2004. Use of genome-scale microbial models for metabolic engineering. Current Opinion in Biotech, 15:64–9.

28.

Patil

K.R.

, Rocha

, Förster

et al. 2005. Evolutionary programming as a platform for in silico metabolic engineering. BMC Bioinformatics, 6:308.

29.

Segre

, Vitkup

, Church

2002. Analysis of optimality in natural and perturbed metabolic networks. Proc. Natl. Acad. Sci. USA, 99:15112–15117.

30.

Schilling

C.H.

, Schuster

, Palsson

B.O.

et al. 1999. Metabolic pathway analysis: basic concepts and scientific applications in the post-genomic era. Biotechnology Progress, 15:296–303.

31.

Schlossnagle

2007. Advanced Php Programming: Developing Large Scale Web Applications With Php 5. Sams.: Indianapolis, IN.

32.

Smith

2010. PostgreSQL 9.0 High Performance by Gregory Smith. Packt Publishing: Birmingham, UK.

33.

Snell

, Tidwell

, Kulchenko

2001. Programming Web Services with SOAP. O'Reilly Media: Sebastopol, CA.

34.

Snoep

J.L.

, Bruggeman

, Olivier

B.G.

et al. 2006. Towards building the silicon cell: A modular approach. BioSystems, 83:207–216.

35.

Stefanov

2010. JavaScript Patterns. O'Reilly Media: Sebastopol, CA.

36.

Suthers

P.F.

, Dasika

M.S.

, Kumar

V.S.

et al. 2009. A genome-scale metabolic reconstruction of mycoplasma genitalium, iPS189. PLoS Comput. Biol., 5:e1000285.

37.

Teusink

, Wiersma

, Molenaar

et al. 2006. Analysis of growth of Lactobacillus plantarum WCFS1 on a complex medium using a genome-scale metabolic model. J. Bio. Chem., 281:40041–40048.

38.

Tong

K.K.L.

2008. Developing Web Services with Apache Axis2. TipTec Development: Macao, China.

39.

Triana

, Montagud

, Gamermann

et al. Genome-scale reconstruction and in silico analysis of the Synechococcus elongatus PCC7942 metabolic networkIn preparation.

40.

The UniProt Consortium. 2007. The universal protein resource (UniProt) Nucleic Acids Res., 35:193–197.

41.

Upton

2007. CodeIgniter for Rapid PHP Application Development: Improve your PHP coding productivity with the free compact open-source MVC CodeIgniter framework! Packt Publishing Ltd: Birmingham, UK.

42.

Varma

, Palsson

B.O.

1993. Metabolic capabilities of Escherichia coli: II. Optimal growth patterns. J. Theor. Biol, 165:503–522.

43.

Varma

, Palsson

B.O.

1994. Stoichiometric flux balance models quantitatively predict growth and metabolic by-product excretion in wild-type Escherichia coli W3110. Applied Environmental Microbiology, 60:3724–3731.

44.

Weise

, Grosse

, Klukas

et al. 2006. Meta-All: a system for managing metabolic pathway information. BMC Bioinformatics, 7:465.

45.

Zhang

, Thiele

, Weekes

et al. 2009. Three-dimensional structural view of the central metabolic network of Thermotoga maritima. Science, 325:1544–1549.