Abstract
Abstract
1. Introduction
Using a statistical framework similar to the one employed in this article, a recent work (Edwards et al., 2007) revisited several enhanced datasets of foraging patterns in wild animals and concluded that the distributions previously classified as power-law were better described by a Gamma distribution. The previous spurious results were attributed in part to the limited magnitude of the dataset (also typical of biological datasets) and to the graphical method used to assess the character of the distributions. It has been suggested (Khanin and Wit, 2006; Stumpf et al., 2007; Tanaka et al., 2005) that the characteristic observation that biological networks follow a power-law distribution may have been reached due to the same methodological shortcomings. Specifically, it has been argued that analyzing relatively small cellular networks, having only a few hundred to a few thousands data elements using the commonly employed frequency-degree or intensity plots, does not have sufficient power to differentiate among various network models having heavy-tailed distributions, and that the use of rank-degree (intensity) plots proves superior for this purpose (Khanin and Wit, 2006; Stumpf et al., 2007; Tanaka et al., 2005). Furthermore, a rigorous quantification of goodness-of-fit is also required to establish relatedness to various network models, and the quality of the underlying data set (i.e., the quality of network reconstruction) is critical for proper analysis.
1.1. The models and the MLE analytical framework
Assessment of the best model explaining a given data distribution has typically been done using simple linear regression methods. These methods are suitable for normal distribution functions, but not for highly skewed distribution functions. Essentially, the problem arises from the fact that skewed distributions are characterized by the scale of the tail, which forms most of the support for the distribution but that is barely populated (i.e., contains less than 10% of the data points). Because of this, simple least square fits of the probability density distribution computed via histogram methods are very poor estimators of the distribution parameters (Tanaka et al., 2005) due to the noisy poor sampling of the tails.
Therefore, it has been argued that density plots should not be used as a base for the fitting of these types of data, as several more reliable methods are available. A simple and better strategy is to use rank-plots as commonly used in engineering and economics (Tanaka et al., 2005). Logarithmic binning (Albert et al., 1999) has also been used as a more robust alternative, but it has been reported that this procedure fails to retrieve the value of the exponent as the slope of the graph for power-law distributions (Tanaka et al., 2005; Goldstein et al., 2004; Clauset et al., 2009). Finally, one could apply a logarithmic transformation to the data and fit the corresponding distribution function (De Fabritiis et al., 2003), thus avoiding the problem of skewed data from the outset. In this case, the appropriate transformation of the probability distribution has to be performed (for instance, a normal distribution for log-transformed, log-normally distributed data), a procedure that could be cumbersome for some distribution functions.
Below we focus on the discrimination of different models (see Section 2.2) for several types of molecular interaction and expression data sets by using probability-based techniques to perform comparative analyses. In order to have a good mathematical representation of the probability distribution, we use cumulative distribution functions (cdf ) that are directly related to rank-plots (Tanaka et al., 2005). In addition, instead of graphical-based estimation methods, we utilize maximum likelihood estimation (MLE see Section 2), which is not dependent on the graphical representation and allows us to perform a statistical test of the proposed models (Goldstein et al., 2004; Stumpf and Ingram, 2005; Hoogenboom et al., 2006; Clauset et al., 2009).
Based on these considerations, here we reexamine both the global topological connectivity distributions and absolute gene and protein expression value distributions in cellular networks, with a focus on the most completely reconstructed, highest quality data sets. We have tested the empirical distributions against several plausible models with heavy-tailed distribution. Note also that generative models in the context of biological networks have been proposed for power-law (Barabási and Albert, 1999; Qian et al., 2001; Rzhetsky and Gomez, 2001; Bhan et al., 2002; Pastor-Satorras et al., 2003) and broad-scale (Amaral et al., 2000) distributions, but not for the rest of the tested distributions. We examine the extent to which several types of distributions, some with plausible underlying generative models, can provide a similar probabilistic framework for the experimental data being analyzed. We show that, for the analyzed molecular interaction distributions, only the high-throughput protein-protein interaction data set and the out-degree distribution of the transcriptional regulatory network significantly fitted the power-law model. In the case of the distributions of global gene and protein expression values, only two mid-log cultures of Escherichia coli showed a statistically significant fit. At any rate, all experimental data sets having significant fits for the power-law distribution showed equally good fits for other distributions, which leads to inconclusive results for the support of a single specific distribution. We discuss the implications of this finding for the uniqueness of generative models solely from topology and activity data distributions.
2. Methods
For model classification, we have basically used the methods developed and implemented in Clauset et al. (2009). However, we provide here sufficient details about the mathematical foundations to make the analytical procedure clear.
2.1. Probability-based estimation method
The likelihood of a given probability distribution p
Here we use different model distributions with their corresponding set of parameters
2.2. Distribution functions estimation
Both continuous and discrete model distributions are used depending on the nature of the data analyzed in each case. As described in Section 2.2.1, we have tested seven models of probability distribution against the studied data. In turn, five model distributions were tested against the continuous data sets: power-law, log-normal, exponential, Weibull, and broad-scale. It is worth noting that only some of the distributions tested (e.g., power-law and broad-scale) have been put forward as distributions agreeing with plausible generative models in the biological context (Barabási and Albert, 1999; Amaral et al., 2000). The intention of this work is to expand the application of the methods to other a priori suitable distributions to give a more general scope to the study.
2.2.1. Distribution functions
Power-law model: A random variable X is said to follow a power-law distribution with index α when the scaling law of P[X > x] is power-law, such as P[X > x] = 1 − P[X ≤ x] ≈ cx−α for x → ∞ . We used the Pareto distribution for the continuous data: probability density function (pdf ):
Log-normal model: A random variable X is log-normally distributed if Y = log(X) follows a normal distribution. We have used the following distributions for the continuous case: pdf:
Poisson model: We have used the discrete Poisson distribution to model the discrete data sets using the following probability mass function (pmf):
Yule model: Again this is a discrete distribution that we have only used for discrete data sets. We used the following forms, pdf: p(k) = ρ
Exponential model: Exponential models have no biological relevance in the framework of our study. However, we have included it in some cases for methodological comparison as a representative model of a non fat-tailed distribution. For a random variable X distributed exponentially, we have used the following forms: pdf: p(x) = λe−λx, cdf: P[X ≤ x] = 1 − e−λx for the continuous data sets; we have taken the pmf:
Weibull model: Also referred to as stretched exponential, we have used pdf:
Broad-scale model: Broad-scale distribution functions, also referred to here as power-law with cut-off or power-law plus exponential, identify a class of distributions that are characterized by a scaling law p(x) ≈ ce−λxx−α for x → ∞ . Note that the broad-scale is very similar to the Gamma distribution. Both contain a shape and a scale parameter. The Gamma distribution in addition contains a normalizing constant. For the Gamma distribution, the pdf would be
2.3. Distribution comparison
To compare the feasibility of the different models with respect to the power-law distribution, we applied a likelihood ratio test, which compares the fits of two given competing distributions. Note that we are not comparing a model against the empirical distribution but two different distributions with each other. Strictly speaking, in this case, we compare different models with each other to indirectly assess how well the distributions explain the data. We have again applied the methods developed in Clauset et al. (2009) and Vuong (1989) to evaluate the normalized log-likelihood ratio (NLLR). If the NLLR has a positive value, the power-law model is supported, whereas the alternative distribution offers a better fit if NLLR has a negative value. In order to determine significant positive or negative values, we calculated the associated p-value. Given the different experimental design with respect to the one described in Section 2.1, we take here as confidence value the most commonly used p = 0.05 (Mayo and Cox, 2006). We search for differences on both sides of the distribution, and therefore we use p = 0.05 as level of significance.
The broad-scale distribution requires a special comment, as it is a nested function containing both power-law and exponential terms. In fact, if we compare the power-law distribution fit with the broad-scale distribution fit, the NLLR test will always be zero or negative, favoring the broad-scale, as it reproduces the power-law behavior plus an additional term. To solve this problem, we used a correction of the p-value calculated for the log-likelihood ratio (LLR) as described in Vuong (1989) and Clauset et al. (2009).
2.4. Data sets used
Molecular interaction maps were obtained from different sources: S. cerevisiae protein-protein interaction networks (Reguly et al., 2006), S. cerevisiae transcriptional regulatory network (MacIsaac et al., 2006), and E. coli metabolic networks (Feist et al., 2007). Yeast expression values were taken from Ghaemmaghami et al. (2003). Global mRNA expression data for E. coli cells were obtained from Beg et al. (2007) and Vázquez et al. (2008).
3. Results
3.1. Global topological organization of molecular interaction networks
The aim of this article was to test by means of a probabilistic approach the ability of heavy-tailed distribution functions to explain the experimental data distributions for given datasets. Due to the nature of the experimental data being considered here, we have differentiated between global interaction networks and global activity networks. Global interaction measurements refer to the number of interactions of a given molecule. Thus, measurements are counted as natural numbers and the studied models will be discrete. On the other hand, global activity interactions are defined in terms of cellular expression, measurements are expressed using positive real numbers, and the applied models will be continuous.
3.1.1. Global interaction networks
To assess the degree distribution of an undirected homotypic molecular interaction network we used data obtained from baker's yeast, S. cerevisiae, in Reguly et al. (2006), which is the most comprehensive information to date for a species-specific protein-protein interaction (PPI) network. We analyzed two data sets built using two different methodologies. The first data set was created from the combination of five different studies based on experimental high-throughput techniques (or HTP data) (Reguly et al., 2006). The second data set was curated manually from the literature (LC data set) and is assumed to be more accurate than the former (Reguly et al., 2006). After the removal of redundant and self-interactions (Reguly et al., 2006), we obtained a data set of 11,571 interactions between 4,474 proteins for the HTP data and 8,165 interactions between 2,689 proteins for the LC data. Table 1 shows that, after fitting the parameters using an MLE, only the HTP data set provides a statistically sufficient goodness-of-fit, based on the KS test. Whether or not the high rate of false-discovery interactions from yeast-two hybrid experiments in the HTP data set leads to the acceptance of the null hypothesis for the power-law model is an open issue for discussion (Huang et al., 2007). In fact, for the same network, if the data is retrieved manually from the literature (LC data set), the power-law model is rejected. Furthermore, for the case of HTP data, exponential and Poisson distributions behave significantly worse than the power-law model while for the LC data set, log-normal, Yule, Weibull, and broad-scale are better models than the power-law. Figure 1a shows the degree distribution of the LC network with its corresponding best fitted power-law model. An equivalent plot for the HTP data set is shown in Figure SI 1a (for Supplementary Material, see www.liebertonline.com).

The connectivity degree distribution of LC-derived S. cerevisiae protein-protein interaction (
The second column shows the p-value associated with the differences between the data and the power-law (PL) model. The next six columns correspond to the log-likelihood ratio tests comparing the PL model with other plausible models. Positive values support the PL model. The normalized log-likelihood ratio (NLLR) is used for non-nested functions, while the raw log-likelihood ratio (LLR) is used for the PL with exponential cutoff model. Here, p-values are associated with the differences between the two models. Significant p-values are given in bold.
The Supplementary Material includes the further results discussed in the text (for online Supplementary Material, see www.liebertonline.com).
We also analyzed the degree distribution of a directed heterotypic molecular interaction network, the transcriptional regulatory (TR) network of S. cerevisiae (MacIsaac et al., 2006), the edges of which represent interactions between transcription factors (TF) and gene regulatory regions, distinguishing out-degree from in-degree interactions. In total the data includes 99 TFs and 1,851 TF-regulated genes connected through 3,394 links. Table 1 clearly indicates that the degree distribution of out-degree interactions (TF → gene) provide a statistically significant fit for the power-law model. However, all other tested distributions except the Poisson fit the empirical data equally well which prevents us from establishing unequivocally the power-law distribution as the statistical model to describe the data. This result should be taken with caution as the sample size (N = 99) is really on the limit to be considered too small (Hart and Clark, 1999). For the case of in-degree interactions (number of TF affecting a given gene) the range of connectivities is too small to provide conclusive results, although it appears that all other tested models apart from the Poisson model offer significantly better fits than the power-law. A graphical representation of the out-degree distribution with their best-fitted power-law model is shown in Figure 1b. Figure SI 1c shows the corresponding plot for the in-degree distribution.
Finally, we assessed the recent high quality reconstruction of the E. coli metabolic network (Feist et al., 2007), in which substrates are connected to each other through edges that represent the actual metabolic reactions (Jeong et al., 2000) with a total of 2,381 reactions, of which 304 are exchange, 553 reversible and 1,524 irreversible reactions. We have studied separately the in-degree distribution with 1,657 nodes and 3,050 links and the out-degree distribution with 1,656 metabolites and 2,788 links. Both networks behave very similarly: both display significant differences with the power-law model. While the log-normal model fits the data equally as well as the power-law model, all other tested models behave significantly worse. The empirical distributions with their corresponding best-fitted power-law models are represented graphically in Figure SI 1e,f.
3.1.2. Clustering coefficients
To gain further insight into the topological organization of molecular interaction networks, we have analyzed the C(k) vs. k relationships for the HTP and LC reconstructed PPI networks of yeast (Reguly et al., 2006), the TR regulatory network of yeast (MacIsaac et al., 2006), and the metabolic network of E. coli (Feist et al., 2007). Figure SI 2 shows how the dependence of the clustering coefficient in the three types of molecular interaction networks is not uniform. While the E. coli metabolic network displays a distribution close to C(k) = k−1 (Fig. SI 2d), as previously described (Ravasz et al., 2002), the distribution observed in the other two networks is significantly less organized.
3.2. Global activity level distributions in molecular interaction networks
Next we assessed the distribution of activity levels achieved on the underlying network topology. First, we examined single time point snapshots for mRNA and protein expression in E. coli and in S. cerevisiae, respectively.
We used global gene expression data from E. coli cells grown in mid-log phase batch cultures with single carbon source media, using acetate, galactose, glucose, glycerol and maltose as individual carbon sources (Beg et al., 2007). Signals for 3,977 probes were examined for the expression level of the corresponding genes based on their hybridization intensity (Table 2). Activity distribution of acetate growing cells significantly fits the power-law model, in fact representing the best fit with the data in all cases analyzed in this study. Interestingly though the broad-scale model gives us a significantly much better fit. Maltose is another carbon source where gene expression distribution of E. coli shows significant fit with the power-law model. However, three other distributions—log-normal, Weibull, and broad-scale—provide significantly better fits than the power-law model. Of the two cases, in which the power-law model displays no significant differences from the empirical data but other models are also plausible, the classification of power-law character remains uncertain. None of the other mid-log cultures provide a significant fit with the power-law distribution. For glucose, for galactose, and for glycerol, log-normal, Weibull, and broad-scale are significantly better models than the power-law, while exponential is a significantly worse model. Worth noting is the graphical comparison of Figure 1c,d. The acetate distribution displays no statistical difference with the power-law model while the galactose distribution does differ from it. This important difference in the nature of the empirical data is not revealed from the inspection of the graphical representations but uniquely from the statistical test.
The Supplementary Material includes the other results discussed in the text (for online Supplementary Material, see www.liebertonline.com).
See Table 1 for explanation of table details.
We also examined the transcriptome state of E. coli cells grown in chemostat cultures at different dilution rates, representing various steady-state growth rates at different culture densities (Vázquez et al., 2008). We find that the general pattern for gene expression distribution is largely invariant in all different growth media and at different growth rates, but for the steady-state cultures none of the empirical data sets fit the power-law model significantly (Table SI 1 and Fig. SI 4). Curiously the broad-scale model outperforms for all steady-state cases. Also note that the log-normal and Weibull distributions display superior fits for dilutions 0.25 and 0.4. In all the cases, the exponential distribution fits the empirical data significantly worse than the power-law distribution.
Protein expression values in mid-log phase of S. cerevisiae cell cultures from 3,868 different strains expressing a single GFP or TAP tagged protein in a rich growth media (Ghaemmaghami et al., 2003) displayed a slightly different expression mode (Fig. SI 5), suggesting difference in mRNA and protein expression value distributions under these growth conditions. Yet again the empirical dataset shows significant differences with respect to the power-law model (Table SI 2). Comparing models, the power-law distribution fits significantly better than the exponential, although similarly to the log-normal and Weibull distributions. However the broad-scale model offers a significantly better fit that the power-law model.
Finally, to assess the transcriptome state of E. coli cells under the most complex growth conditions, we have examined the dynamical microarray profile of E. coli cells grown in batch culture in a mixed-substrate medium (Beg et al., 2007). At all sampled time points, the distribution of expression of the transcriptome was differing significantly from the power-law distribution (Table SI 3). In this mRNA expression data set the log-normal, Weibull and broad-scale models show better statistical fit with the experimental data at all time points, while the exponential distribution is also better for some of the cases. A graphical representation of the best fitting power-law distribution against the empirical data is shown in Figure SI 6.
4. Discussion
The structure and activity of intracellular molecular interaction networks are a basic tenet of life. Thus, it is of great importance to correctly identify their true structural and functional properties at the global, intermediate, and local levels. Heavy-tailed probability distributions have been used in the literature to explain the topological features of complex biological networks. In this article, we have analyzed to what extent, among others, three well-known heavy-tailed distributions (power-law, broad-scale and log-normal) can be uniquely used to represent the observed experimental data for protein interaction, transcription regulation, metabolic pathways and expression experiments. Our results demonstrate that the large-scale topology of the molecular interaction networks and the global mRNA and protein expression distributions examined here do not strictly follow power-law distributions. Moreover, none of the three heavy-tailed models tested had a universal agreement with the empirical data, even when using the highest quality data sets available. Distributions are evidently heavy-tailed, and for this type of data, MLE analyses prove superior to graphical methods for assessing different tested distributions. However, we could not assess any of the studied models with statistical reliability. Therefore the classification of the generative and evolutionary models of the studied networks remains to determined unequivocally. Our results could be explained by several non-exclusive arguments. First, our analyses could be affected by a biased acquisition of the empirical data, a systematic error that would affect, for instance, proteins which have been studied more extensively because they have a high biomedical interest. Also, such simple mathematical frameworks as considered here may not represent entirely the biological mechanisms at work, but also be the convolution of the physicochemical constraints of the cell (Beg et al., 2007). Finally, population-level evolutionary processes, such as activation of a foraging program upon extracellular substrate exhaustion, clearly affect the function of cellular networks (Beg et al., 2007) and are likely to influence the nature of the underlying molecular interactions as well.
Additional problems can have their origin in data acquisition. In some cases, the experimental data might be incomplete or noisy, especially for the HTP data, which derives partially from yeast two-hybrid experiments. There have been several efforts (Huang et al., 2007; Scholtens et al., 2008) to infer statistically the actual PPI network from such experiments; consequently our results refer to the data, not the actual network. Indeed, analysis of the reconstructed network from the literature, in a manually curated way, which is considered to be more correct, yields different results. The inference of the actual network is out of the scope of the current work.
We should also note that the topology of cellular networks can be viewed at various levels of complexity. For example, in the metabolic representation considered here, each metabolite that participates in an interconversion reaction is considered equal (Jeong et al., 2000). However, in alternative representations, molecules that only act as donor or acceptor (e.g., ATP or ADP) or that do not contribute a carbon or nitrogen atom to a metabolic reaction can be excluded (Arita, 2004), or they can be pre-classified according to their perceived biochemical role (Tanaka, 2005). Similarly, in TR networks, genes and their protein products are usually defined as common nodes with regulatory links among them being mediated by the binding of TFs to the promoter regions of the genes (Shen-Orr et al., 2002). At other times, however, genes and their protein products are considered as separate nodes (Yeger-Lotem et al., 2004). While certain properties are unaffected by alternative network representations, others are affected (Arita, 2004; Tanaka, 2005), thus potentially complicating the interpretation of subsequent analytical results.
Finally, our analyses also reveal highly similar, but dynamically regulated global mRNA and protein expression profiles, in which expression values are significantly variable. Thus, while the global function of cellular networks is expected to be greatly influenced by their underlying topology (Almaas et al., 2004; Ueda et al., 2004; Goh et al., 2001), mRNA or protein expression data reflect the dynamic physicochemical state of the cell, likely necessitating an explicit dynamical model for establishing a relationship between topology and expression.
Footnotes
Acknowledgments
We acknowledge A. Clauset and C.R. Shalizi for making publicly available the implementation of the methods. We thank R.V. Solé, R. Albert, P. Delicado, P. Rué, M. Dies, and S. Laurie for comments at different stages of the work. We also thank the constructive comments from two anonymous reviewers. A.L.G. de L. acknowledges Generalitat de Catalunya for FI and BE grants. G.D.F. acknowledges the Ramón y Cajal granting scheme. This research was supported in part by EC funded FP6 STREP projects BioBridge (contract number FP6-037909), QosCosGrid (contract number FP6-033883), VPH NoE (contract number FP7-223920), and MCINN (contract number CTQ2008-00755).
Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
