Sensitivity Analysis of Genome-Scale Metabolic Flux Prediction

Abstract

TRIMER, Transcription Regulation Integrated with MEtabolic Regulation, is a genome-scale modeling pipeline targeting at metabolic engineering applications. Using TRIMER, regulated metabolic reactions can be effectively predicted by integrative modeling of metabolic reactions with a transcription factor-gene regulatory network (TRN), which is modeled through a Bayesian network (BN). In this article, we focus on sensitivity analysis of metabolic flux prediction for uncertainty quantification of BN structures for TRN modeling in TRIMER. We propose a computational strategy to construct the uncertainty class of TRN models based on the inferred regulatory order uncertainty given transcriptomic expression data. With that, we analyze the prediction sensitivity of the TRIMER pipeline for the metabolite yields of interest. The obtained sensitivity analyses can guide optimal experimental design (OED) to help acquire new data that can enhance TRN modeling and achieve specific metabolic engineering objectives, including metabolite yield alterations. We have performed small- and large-scale simulated experiments, demonstrating the effectiveness of our developed sensitivity analysis strategy for BN structure learning to quantify the edge importance in terms of metabolic flux prediction uncertainty reduction and its potential to effectively guide OED.

1. INTRODUCTION

Optimal experimental design (OED) and control for complex biological systems have significant impact on developing new computational strategies in systems and synthetic biology (Balsa-Canto et al, 2021; Zhao et al, 2020) for targeted biochemical overproduction that may benefit human society, for example, in different energy-related and pharmaceutical applications (Barrett et al, 2006; Bro et al, 2006; Esvelt and Wang, 2013; Haro and de Lorenzo, 2001; Lü et al, 2011; Luengo et al, 2003; Ohta et al, 1991). In particular, metabolic engineering, which genetically redesigns microbial strains by gene or reaction knockouts, aims to optimize corresponding biological processes with respect to the desired engineering objective(s).

Owing to the demanding experimental cost and time to test different microbial strains in vivo, computational methods have been developed for in silico prediction of useful knockout strategies for beneficial mutants. Mathematical models to systematically analyze genome-scale metabolic reaction networks have been developed to derive optimal intervention strategies to achieve the desired metabolic reaction fluxes, the turnover rates of the molecules through the corresponding metabolic pathways (Edwards and Palsson, 2000; Segre et al, 2002; Varma and Palsson, 1994).

However, many existing computational methods to obtain genetically engineered strains are based on genome-scale analysis at steady states assuming the static network models (Apaydin et al, 2017; Apaydin et al, 2016; Burgard et al, 2003; Edwards and Palsson, 2000; Ren et al, 2013; Segre et al, 2002; Shlomi et al, 2005; Varma and Palsson, 1994). Recent efforts have integrated genetic regulatory relationships involving transcriptional factors (TFs) that may regulate metabolic reactions to achieve more accurate and robust prediction of target metabolic behaviors under different conditions or contexts (Chandrasekaran and Price, 2010; Covert and Palsson, 2003; Covert et al, 2008; Machado and Herrgård, 2014; Motamedian et al, 2017; Reed, 2017; Shlomi et al, 2007; Yu and Blair, 2019).

To generalize these integrated hybrid models, we have developed Transcription Regulation Integrated with MEtabolic Regulation (TRIMER) (Niu et al, 2021) as a modeling pipeline targeting at metabolic engineering applications. Using TRIMER, regulated metabolic reactions can be effectively predicted by integrative modeling of metabolic reactions with a TF-gene regulatory network (TRN), where the TRN is modeled through a Bayesian network (BN) inferred from transcriptomic expression data. We have demonstrated promising metabolic flux prediction performances in both simulated and real-world microbial mutant design applications considering transcription regulation in genome-scale metabolic prediction (Niu et al, 2022; Niu et al, 2021).

Although all the existing efforts have demonstrated valid performances in selected model organisms with abundant data and careful manual curation, there is not much investigation on how model uncertainty, due to incomplete system knowledge and/or limited training data, may affect metabolic predictions. To the best of our knowledge, most of the existing works assume that the trained models are deterministic without considering potential model uncertainty. Compared with the traditional ways where model uncertainty is often represented by how well models fit the training data, we propose to analyze how model uncertainty affects the metabolic engineering performance directly.

In particular, we analyze metabolic flux prediction sensitivity with respect to the uncertainty class of BN structures for TRN modeling as one possible way of uncertainty quantification. Through a mathematical programming formulation of BN topological ordering, we construct uncertainty classes of BN models for TRN and analyze the metabolic prediction sensitivity of our TRIMER modeling pipeline. We evaluate the sensitivity of the TRIMER pipeline by comparing the ground-truth metabolite yield alterations with TF knockout mutations based on the BN uncertainty classes with the network edge space ranked by topological ordering.

To be specific, we simulate both gene expression and metabolic flux data from a predefined ground-truth TRN-regulated metabolic network model, infer models from the simulated data, construct uncertainty classes, and then check the prediction sensitivity by checking the correlation of the predicted and ground-truth metabolite yields. The obtained sensitivity analyses can provide useful guidance for model learning, calibration, and OED for metabolic engineering, allowing biologists to better understand metabolism under perturbation and to take advantage of high-throughput genetic engineering for desired microbial strains with reduced cost.

2. BACKGROUND

In Niu et al (2022, 2021), we have developed an integrated regulatory-metabolic hybrid network model and genome-scale metabolic analysis pipeline, TRIMER. TRIMER enables condition-dependent genome-scale metabolic behavior modeling and provides in silico predictions of metabolic engineering tasks, such as knockout phenotype and knockout flux predictions. As a hybrid network model, TRIMER has a metabolic network module that predicts metabolic fluxes based on the classic flux balance analysis (FBA) framework (Covert and Palsson, 2003; Edwards and Palsson, 2000; Lewis et al, 2010; Palsson, 2015; Varma and Palsson, 1994).

This well-known technique adopts linear programming with flux constraints introduced by metabolic regulation rules (i.e., a metabolic network) of a given organism, to model genome-scale steady-state metabolic status. For accurate and robust prediction of metabolic behaviors under different conditions, regulatory relationships, in the corresponding TF regulatory network (TRN) between TFs and their target genes, are integrated with the FBA formulation as additional transcriptional constraints over fluxes (Apaydin et al, 2017; Covert et al, 2008; Jensen et al, 2011; Ren et al, 2013; Shlomi et al, 2007; Shlomi et al, 2005).

Motivated by another genome-scale framework PROM (probabilistic regulation of metabolism) (Chandrasekaran and Price, 2010), TRIMER adopts a TRN module that integrates the a priori known TF-gene interaction annotations and available gene/transcriptomic expression profiles to learn a corresponding BN for probabilistically modeling the TRN. The conditional probabilities $P r (g e n e (s) | T F (s))$ can be inferred from the learned BN and used for the construction of regulatory constraints in the FBA formulation, predicting the metabolic behaviors under different conditions, for example, TF knockouts in metabolic engineering. A typical formulation that predicts the regulated metabolic fluxes is shown as follows:

where S is the stoichiometric matrix deduced from the given metabolic network, $\vec{v}$ is a real-value vector representing metabolic fluxes, $α$ and $β$ can be considered as slack variables, $κ$ is a hyperparameter that controls the penalty of exceeding the fluxes bounds, $l b'_{i}$ and $u b'_{i}$ are regulatory flux upper/lower bounds computed based on $P r (g e n e (s) | T F (s))$ and flux variability analysis (Mahadevan and Schilling, 2003). Readers can refer to Niu et al (2021) for more details about flux bounds construction and how TRIMER predicts corresponding metabolic fluxes (Niu et al, 2022).

In practice, the BN is learned from a gene expression data set with a limited number of data points, giving rise to potential uncertainty in BN structures, the estimated conditional probabilities $P r (g e n e (s) | T F (s))$ , and consequently, the metabolic flux predictions. In this article, we focus on how BN structure uncertainty impacts metabolic flux predictions and consider sensitivity analysis of metabolic flux predictions with respect to the learned BN structures as one type of uncertainty quantification. Our goal is to identify the most important edges in terms of reducing the metabolic flux prediction uncertainty, which can further guide the OED of BN modeling.

Although TRIMER is motivated by PROM, the adopted flux bounds in PROM are based on $P r (g e n e | T F)$ estimated by the relative frequency of the corresponding gene-TF state pairs in the given TF-gene interaction annotations. As there is no BN modeling involved and network dependency is not considered in PROM, the conducted sensitivity analysis in PROM was on frequentist estimation of the conditional probabilities given individual TF knockout (Chandrasekaran and Price, 2010), which is different from the BN structure uncertainty quantification of metabolic flux predictions in TRIMER investigated in this article.

3. MATERIALS AND METHODS

In this section, we first provide a brief overview of our proposed BN structure sensitivity analysis strategy, followed by detailed descriptions of all the steps of the procedure.

3.1. Overview: sensitivity analysis of TRIMER

Sensitivity analysis of TRIMER is achieved by evaluating metabolic flux prediction performances based on the uncertainty classes of BNs modeling TRN. In particular, we focus on the uncertainty of BN structures that directly impacts metabolic flux prediction. To construct an uncertainty class, the key idea is to first infer the BN topological ordering learned from gene expression data. We then grow various BN structures from the inferred node ordering to construct uncertainty classes at different perturbation levels.

By statistical analysis of metabolic predictions with different uncertainty classes, we can have better understanding on how TRN modeling may affect the final metabolic predictions in the TRIMER hybrid model. With such sensitivity analyses as a way of the BN model uncertainty quantification, practical guidance can be obtained inform researchers for active learning and calibrating the modular components in TRIMER through optimal experiment design, for example, defining the iterative structure updating policies for BN to improve metabolic prediction.

To develop such a sensitivity analysis capability in TRIMER, we first implement a BN topological order search algorithm to infer the network node ordering of BNs given transcriptomic expression data and the search space of BN edges. The topological ordering of BNs determines the parenthood of nodes: an ancestor node must be of higher order than its descendant nodes. Therefore, given node parent sets and a topological ordering of interest, the global optimal consistent structure can be obtained. To measure the uncertainty of the corresponding topological ordering, ordering samples are obtained by bootstrapping the adopted order search algorithms (Friedman et al, 1999). Accordingly, the statistics of bootstrapped ordering samples can be computed to help deriving the probability distributions and thereafter uncertainty quantification by constructing BN uncertainty classes.

The next question is how we may capture the potential model uncertainty from bootstrapped BN topological orderings and derive a rigorous mathematical framework to construct the classes of uncertain BN models for our proposed metabolic prediction sensitivity analysis. Following Xiao et al (2018), we adopt a mathematical programming formulation to rank the pairwise ordering of nodes or directed edges in the BN modeling TF-gene regulatory relationships. The formulation aims at identifying the critical regulatory relationships for which reducing the ordering uncertainty can help significantly improve the BN model fitting to the given training data set.

Solving this problem, all valid edges in the given edge space for the BN are ranked by their contributions to the uncertainty reduction, where the uncertainty is captured by the corresponding covariance matrix of ordering scores computed from bootstrapped ordering samples. We then construct the BN uncertainty class of different sizes, where edges with higher rankings are more likely to be sampled to form the allowed edge space to derive the best BN structure consistent with the topological ordering. Finally, transcription-regulated genome-scale metabolic predictions with the BNs in the corresponding uncertainty class can be done following the TRIMER pipeline to investigate prediction sensitivity. The details of the TRIMER analysis pipeline can be referred to Niu et al (2022).

In summary, our proposed strategy mainly comprises three steps: (1) BN topological ordering, (2) uncertainty class construction, and (3) metabolic prediction sensitivity analysis. Figure 1 provides a high-level overview of the strategy, depicting the main workflow.

FIG. 1.

Schematic illustration of the proposed sensitivity analysis workflow: (1) gene expression data with the prior knowledge on regulatory interactions are used to infer the topological orderings of nodes in BNs. (2) The uncertainty class of BNs is constructed based on the uncertainty of the topological ordering due to incomplete knowledge and/or limited data. (3) TRIMER pipeline is used with the uncertainty class to analyze the metabolic flux prediction sensitivity. TRIMER, Transcription Regulation Integrated with MEtabolic Regulation.

3.2. Order-based Tabu search

In the TRIMER pipeline (Niu et al, 2021), BN structure learning is by fitting the given gene expression profiles D for the corresponding gene set $X = {X_{n}, n \in [1, N]}$ and the given TF-gene interaction list E, where X and E are interpreted in BN as the set of network nodes and the search space of network edges, respectively. Uncertainty quantification directly by analyzing factorized conditional probability distributions $P (G | D)$ is challenging as it depends on the BN graph structure G, which is combinatorial. We hence propose a topological order-based uncertainty quantification strategy, for which we here first define the topological ordering-based score, with the uncertainty modeled by a Gaussian distribution as detailed in Section 3.3. To be specific, to learn the BN topological ordering, denoted by $≺$ , our implementation follows the same idea in Teyssier and Koller (2012), where an order-based heuristic search was proposed. For candidate BN structures, the score of a given ordering $≺$ can be defined as follows:

where $X_{i} \in X$ denotes a node with its parent node set $P a_{≺} (X_{i}, E)$ consistent with the ordering $≺$ in the given edge space E, $y ≺ X_{i}$ indicates that the order of y is higher than X_i, d is the predefined upper bound of node's in-degree, and score $(G_{≺}; D; E)$ is a decomposable score function used in the BN structure learning, such as Bayesian information criterion score or Bayesian Dirichlet (BD) score. In our implementation, we adopt the BD score. To find the best ordering, we use a heuristic search algorithm—multi-restart Tabu search—originally proposed in James et al (2009) based on the decomposed node-wise score function $s c o r e' (\cdot; \cdot; \cdot)$ . To perform the ordering search, we define a swap operator $s w a p (X_{t_{j}}, X_{t_{j + 1}})$ over nodes with the adjacent ordering in the t-th iteration as follows:

(X_{t_{1}}, \dots, X_{t_{j}}, X_{t_{j + 1}}, \dots, X_{t_{N}}) \to (X_{t_{1}}, \dots, X_{t_{j + 1}}, X_{t_{j}}, \dots, X_{t_{N}}) .

The best swapping operation is selected among all $n - 1$ candidate successors in the t-th iteration. This simplified search procedure based on the swap operator significantly reduces the computational cost of ordering comparison. Supposing that $≺^{t}$ is changed to a new ordering $≺^{t + 1}$ by $s w a p (X_{t_{j}}, X_{t_{j + 1}})$ , the delta-score of the induced ordering, difference from the original ordering, only depends on the delta-score of $X_{t_{j}}$ and $X_{t_{j + 1}}$ . In addition, the only new operators deduced from $≺^{t + 1}$ are $s w a p (X_{t_{j}}, X_{t_{j + 2}})$ and $s w a p (X_{t_{j - 1}}, X_{t_{j + 1}})$ . In each iteration, we can find the optimal parent set for any node in $O (f_{m a x})$ = $O (N^{d})$ as there are $f_{m a x} = (\begin{matrix} N \\ d \end{matrix})$ possible parent sets per node with the maximum in-degree d (Teyssier and Koller, 2012). With c iterations, the time complexity of the algorithm is $O (c N^{d})$ . The pseudocode of the implementation is provided in Algorithm 1.

3.3. Order initialization

Initialization is crucial to guarantee satisfactory performance of heuristic search as the exact optimality is not ensured due to both the constrained search space and the greedy nature. In this article, the TF-gene interaction list is not just used to help define the edge space but also as the prior knowledge of node ordering. For example, the ordering of the TF nodes cannot be lower than its regulated gene nodes. Otherwise, the corresponding TFs cannot be the parent nodes of its regulated genes in BNs. In the following, we will introduce an order initialization algorithm through extracting the node ordering prior knowledge from a given TF-gene interaction list.

In the extreme case, the edge space E defined by a given interaction list corresponds to a directed acyclic graph (DAG), for which the optimal BN topological ordering is just any ordering consistent with the graph structure. However, in practice, the edge space almost always corresponds to a directed graph with cycles. Although the optimal node ordering cannot be directly read off from the graph structure, it is still possible to obtain prior knowledge about it. For a directed graph $G = (X, E)$ , its strongly connected components (SCCs) denoted as $C = {C_{m} : m \in [1, M]}$ are defined to be the maximal sets of nodes such that for each set, every pair of nodes within the set are reachable from each other.

A graph of C can be denoted as $G^{c} = (C, E^{c})$ , where an edge exists between two SCCs if there is at least one edge between two nodes belonging to the two SCCs, respectively. By the definition of the SCC, G_c must be a DAG. Therefore, G_c determines a component-wise ordering $≺^{c}$ , which provides us partial knowledge about how to initialize the node ordering $≺^{0}$ to inform the following order-based heuristic search algorithm. The globally optimal node-wise ordering $≺^{*}$ must be consistent with the ordering $≺^{c}$ while relative orders among nodes within the same SCC are still undetermined.

To obtain an appropriate initial node ordering $≺^{0}$ , G_c and the corresponding ordering $≺^{c}$ are first identified from $G = (X, E)$ , where SCCs are identified by Tarjan's algorithm (Tarjan, 1972) with the computational complexity $O (| E | + | X |)$ . Next, node ordering $≺^{0}$ consistent with $≺^{c}$ can be found easily, where relative orders of nodes belonging to the same SCC are randomly generated. The pseudocode of the initialization algorithm is shown in Algorithm 2 and the corresponding workflow is illustrated in Figure 2.

FIG. 2.

Illustration of the order initialization workflow. First, the DAG of SCCs is identified from the given directed graph; then a node-wise ordering consistent with the SCC ordering is randomly selected. DAG, directed acyclic graph; SCC, strongly connected component.

3.4. Semidefinite programming formulation for uncertainty class construction

To quantify the uncertainty of topological ordering, we use a multivariate Gaussian random vector $ϕ \in R^{| X |} \sim N (μ, Λ^{- 1})$ as the numerical score representation of an ordering $≺$ , where each element represents the score of the corresponding node and the nodes of higher order are supposed to have larger scores in $ϕ$ . By bootstrapping the previously described order-based search algorithm, we can estimate the corresponding distribution parameters of $N (μ, Λ^{- 1})$ by the bootstrapped samples of $ϕ$ . Bootstrapping here means repeatedly perturbing D and applying the order search algorithm on the perturbed data sets to obtain a set of perturbed local optimal orderings.

To establish the relationship between node-wise orders and edges, we associate edge $E_{k} = (X_{i}, X_{j}), E_{k} \in E$ with a real-value random variable $y_{k} \sim N (ϕ_{i} - ϕ_{j}, γ^{- 1})$ to represent the pairwise order difference, where $γ$ is a hyperparameter. Values of y_k can indeed be interpreted as the confidence of the corresponding pairwise ordering: the larger the value of y_k is, the more confident we are to support the ordering induced by edge E_k. As proposed in Xiao et al (2018), a binary matrix $B \in {- 1, 0, 1}^{| E | \times | X |}$ can be used to collectively represent all edges in E, where for $E_{k} = (x_{i}, x_{j})$ : $B_{k, l} = \{\begin{matrix} \begin{matrix} 1, & i f l = i \\ - 1, & i f l = j \\ 0, & o t h e r w i s e . \end{matrix} \end{matrix}$

In cases when only a subset of E is considered, a binary matrix $D i a g (v) B$ is used as the corresponding matrix representation, where $v \in {0, 1}^{| E |}$ is a binary vector, $D i a g (v)$ denotes the corresponding diagonal matrix. Therefore, $y \sim N (D i a g (v) B ϕ, Γ^{- 1})$ represents the pairwise ordering confidence about all the edges of interest, where $Γ$ is a hyperparameter covariance matrix. It should be pointed that $P (ϕ | μ, Λ^{- 1})$ is a conjugate prior of $P (y | D i a g (v) B ϕ, Γ^{- 1})$ as they are both Gaussian. Therefore, it can be easily verified that (Xiao et al, 2018):

where $B^{*} = Q B$ and $Q^{T} Q$ is the Cholesky factorization of $Γ$ . It can be observed that $Λ'$ quantifies the effect of the edge set of interest over the uncertainty of ordering $ϕ$ . Therefore, a straightforward idea to rank edges is by their contribution to the uncertainty reduction. To be more specific, the ranking is achieved by solving a semidefinite programming (SDP) problem proposed in Xiao et al (2018), which is defined as follows:

In the formulation aforementioned, $λ_{1} (\cdot)$ denotes the smallest nonzero eigenvalue of the matrix, $b_{i} \in {- 1, 0, 1}^{1 \times | X |}$ denotes a row vector corresponding to the $i_{t h}$ row of $B^{*}$ , and $τ$ is a scalar representing the size of a selected edge subset. The values of the elements in the resulting vector v help identify the top $τ$ edges in items of uncertainty reduction. By solving the SDP repeatedly for $| E |$ times with $τ$ increasing from 1 to $| E |$ , the growing ordering of selected edges implies a ranking of edges in E. It should be pointed out that v is relaxed to to guarantee the convexity of the SDP. While this relaxation leads to potential interpretation ambiguity of the solution as the number of nonzero elements in v can be higher than $τ$ . However, we can still select the top $τ$ edges based on the magnitudes of values in v. To construct uncertainty classes of BN models for TRN in TRIMER, ranked edges are assumed to comply with a distribution defined as follows: $p (i) = \frac{2}{| E | (| E | + 1)} (| E | - r a n k (i)),$ (4)

where i denotes the index of an edge in E and $r a n k (i)$ denotes its rank in terms of uncertainty reduction. We then construct a BN uncertainty class in the following way: By the distribution defined earlier, we first draw multiple edge sample sets of the same size, denoted as $S = {S_{l} | l \in [1, L]}$ . For the corresponding sampled set, the best BN structure is identified by $G^{S_{l}} = {argmax}_{G ≺ μ} s c o r e (G_{^{≺ μ}}; D; S_{l})$ , where $≺^{μ}$ is deduced from the numerical mean $μ$ of the ordering samples.

3.5. TRIMER as a simulator

As described in Niu et al (2021), TRIMER can serve as a simulator of gene expression and metabolic flux data. Given a ground-truth BN model for TRN with appropriate conditional probability tables for each node in the BN, gene expression data sets can be simulated by drawing samples from the distribution described by the BN. Moreover, conditional probabilities with respect to TFs and target genes can be inferred from the BN and used for constructing regulatory flux constraints for genome-scale metabolic predictions when they are integrated with the available metabolic reaction network model of the organism under study. Adding these new constraints into the corresponding FBA formulation for the metabolic network, condition-dependent metabolic states of the organism can be simulated and treated as the ground truth.

For sensitivity analysis, we simulate gene expression data for BN uncertainty class construction and ground-truth metabolic fluxes to estimate the biomass for different TF-knockout Escherichia coli strains as described in Niu et al (2021). We investigate the model uncertainty by computing the corresponding Pearson correlation coefficients (PCCs) between ground-truth biomass fluxes and the predicted fluxes by TRIMER based on BN uncertainty classes, of which the BN structures may significantly deviate from the ground-truth model that simulates the data.

4. RESULTS

In this section, we present the experimental results based on two simulated data sets to demonstrate the effectiveness of proposed regulatory order-based sensitivity analysis strategy for TRIMER.

4.1. Sensitivity analysis results

In our experiments, simulated ground-truth TRIMER models are used to generate gene expression data as well as metabolic flux data (Niu et al, 2021), which are used as the training data sets for BN learning and the ground-truth metabolic flux predictions for sensitivity analysis and performance evaluation. We focus on biomass prediction under multiple TF-knockouts while the proposed methods can be used for other metabolite yield predictions based on the problems of interest.

We here use the two same simulated TRIMER models for E. coli with iAF1260 (King et al, 2016) as their genome-scale metabolic network model as described in Niu et al (2021). One model is based on a small-scale TRN and the other is based on a large genome-scale TRN as detailed in the following subsections. More detailed descriptions on the TRIMER models, data sources, software requirements, hardware setups, as well as run-time statistics of each TRIMER component can be found in Niu et al (2022, 2021). All the reported experiments are implemented on a PC with Intel i7 processor and 16GB RAM.

4.1.1. Small-scale model sensitivity analysis

For the small-scale model, its corresponding TRN contains 50 nodes (12 TFs and 38 regulated genes) with 118 randomly generated edges. It is assumed that regulated genes cannot be the parents of TFs in the BN model. The ground-truth metabolic fluxes in this model are simulated for all the 12 TF single-knockout conditions. Besides, we generate a gene expression data set of 1000 samples from the ground-truth BN model. In light of experimental results reported in Niu et al (2021), a data set of 1000 training samples is believed to be adequate to guarantee reasonable order learning to achieve desired metabolic flux prediction performance.

Then we bootstrap the Tabu order-based search for 100 times over the data set obtaining 100 ordering samples, where one time running of Tabu order-based search typically takes 5–8 minutes. The covariance matrix of ordering scores based on boostrapped ordering samples is shown in Figure 3a. As shown in the figure, the variances corresponding to TFs with indices from 1 to 12 are much smaller than the ones for regulated genes indexed from 13 to 50. This is reasonable as the regulated genes are mostly downstream and their orderings can be more uncertain compared with TFs.

FIG. 3.

Estimated covariance matrices of topological ordering scores based on bootstrapped samples for (a) small-scale and (b) large-scale network models to help rank the pairwise TF-gene ordering corresponds to edges in the BN uncertainty class. Warmer shadings indicate higher uncertainty regarding the pairwise orderings, hence specifying the corresponding edges may help significantly reduce the BN model uncertainty. The matrices are then used to inform the SDP formulation to help quantify the BN edge importance. BN, Bayesian network; SDP, semidefinite programming; TF, transcription factor.

Next, edges in the predefined edge space based on the available prior knowledge for the TRN are ranked by their contribution to uncertainty reduction during the ordering posterior covariance updates through the mathematical programming formulation detailed in Materials and Methods section. As the edge space is relatively small in this small-scale example, the edge ranking can be completed within a few seconds. Ten BN uncertainty classes, each of which contains 10 graph structures consistent with the ordering numerical mean, are constructed from sampled edge sets of sizes ranging from $10 %$ to the whole edge space.

Finally, the TRIMER model with constructed BN uncertainty classes is built and biomass predictions are made following the TRIMER pipeline. In our experiments, we calculate PCCs between predictions and simulated ground-truth fluxes to evaluate TRIMER's performance under a specific BN configuration in the uncertainty class. We have also performed t-tests between the PCC values of metabolic flux predictions obtained based on the uncertainty class covering the whole edges space and the corresponding PCC values from the other uncertainty classes. The corresponding p-values are calculated to show the statistical significance of performance change. The final metabolic prediction performance by TRIMER with different BN uncertainty classes is illustrated in Figure 4a for this experiment.

FIG. 4.

Box plots of PCC between the metabolic prediction by TRIMER with constructed BN uncertainty classes and the ground truth. Results are shown for (a) small-scale and (b) large-scale BN uncertainty classes as a function of the number of sampled edges and the average size of the BNs in the uncertainty classes. For both small- and large-scale experiments, we can see that both the prediction performance and prediction sensitivity of the uncertainty classes considering >70% of the edge space do not have significant difference from the uncertainty class considering the whole edge space. The p-values of the t-tests based on PCC values between the last uncertainty class covering the whole edge space and each of the other uncertainty classes are also shown in the plot to illustrate the statistical significance of the corresponding performance change, where the cases with p-values $< 0.05$ and $< 0.01$ are marked with the symbols * and **, respectively. The metabolic prediction sensitivity, illustrated by the quantile bars in the plot, decreases in general as the additional edges included in the edge space are less critical to achieve robust predictions. It can be observed that despite high sensitivity, TRIMER's performance can still be relatively high when we enforce sparser BN structures. This demonstrates the effectiveness of the proposed SDP for edge ranking as the most important edges found by the SDP are assigned with the highest sampling probabilities for uncertainty class construction. When considering these edges in constructing the uncertainty classes, metabolic flux prediction performance and sensitivity are relatively stable. PCC, Pearson correlation coefficient.

Increasing the size of edge sample sets from 10% to the whole edge space corresponds to decreasing the uncertainty of BN classes by extending the edge space. As the edge space is extended by the top regulatory relationships in terms of BN topological ordering uncertainty reduction, BN classes of high uncertainty can still maintain essential edges.

To further evaluate the prediction sensitivity of BN models for the small-scale TRIMER model, we conduct an experiment to investigate how its performance varies by changing the size of training gene expression data sets to construct the BN uncertainty classes. We have generated five expression data sets with sizes ranging from 200 to 1000. For each data set, we evaluate the performance of TRIMER with the uncertainty classes constructed from the corresponding sampled edge sets whose sizes are fixed to the half of the complete edge space. The experimental results are depicted in Figure 5, where we provide the t-test p-values based on the corresponding PCC values between the last condition and each of the other conditions.

FIG. 5.

Box plots of PCC between the metabolic prediction by TRIMER and the ground truth. Results are shown for uncertainty classes constructed based on simulated gene expression data sets of different sizes. As statistical supports, the p-values of the t-tests between the PCCs in the last condition and each PCC set of the other conditions are also shown in the plot to illustrate the statistical significance of the corresponding performance change, with p-values $< 0.05$ and $< 0.01$ marked with the symbols * and **, respectively. Note that the prediction performance significantly improves when we have 400 training gene expression data points. After that, the prediction performance slowly increases with the increasing number of training data points.

Overall, with the increasing training samples for BN learning, both the average prediction accuracy, measured by PCC, and sensitivity, shown in the quantile bar plot, improve in general. The prediction performance may not improve much in average when we have >400 training samples. However, when investigating prediction sensitivity to the inherent model uncertainty with our regulatory order-based uncertainty class construction, our results indicate that more training samples may need to guarantee the desired robust predictions. In our experiment, we observe that we require 1000 gene expression samples to achieve accurate and stable predictions.

We conduct another experiment to further investigate how flux predictions may be affected by the noise in gene expression data for BN training. We randomly flip the corresponding ON/OFF states of 10%, 20%, …, 50% genes in a simulated data set of 1000 samples, resulting in five perturbed gene expression data sets at different noise levels. For each data set, we evaluate the performance of TRIMER with the uncertainty classes constructed as previously described based on topological ordering derived from the corresponding perturbed gene expression data. For this experiment, we fix the sampled edge set size to be the half of the complete edge space.

The experimental results are shown in Figure 6, where t-tests are performed between the uncertainty class constructed from nonperturbed data set and each of the other uncertainty classes from data sets perturbed at different levels. From the plot, the performance declines significantly when > $30 %$ of gene expression states are flipped, with the corresponding p-value equal to 0.0018 showing the statistical significance of the performance change. As the gene expression noise is reflected in the uncertainty classes of learned BN models, the observed trends of flux predictions with different gene expression noise levels are similar as in the sensitivity analysis with different perturbation levels to BN structures directly.

FIG. 6.

Box plots of PCC between the metabolic flux predictions by TRIMER and the ground truth. Results are shown for the uncertainty classes constructed based on gene expression data perturbed at different levels $ρ$ , the flipping probability to perturb gene expression states. The p-values of the t-tests performed between the PCCs of the uncertainty class constructed from the nonperturbed data set and each PCC set of the other uncertainty classes from data sets perturbed at different levels are also shown in the plot to illustrate the statistical significance of the corresponding performance change, with p-values $< 0.05$ and $< 0.01$ marked with the symbols * and **, respectively. Note that the prediction performance drops significantly at 30%.

All these experiments with this small-scale ground-truth model have demonstrated that our topological ordering based sensitivity analysis strategy can appropriately identify important edges that directly impact our objective of reliably predicting metabolic fluxes in the TRIMER pipeline. When needed, it can help OED in the TRIMER pipeline, where BN can be calibrated and updated iteratively based on the proposed sensitivity analysis. On the contrary, blind random edge sampling may not give rise to an informative performance plot.

4.1.2. Large-scale model sensitivity analysis

For the large-scale model, we consider a genome-scale TRN with 1509 edges randomly selected from the edge space containing 3704 edges in the annotated interaction list for E. coli in EcoMAC (Carrera et al, 2014). When constructing the BN uncertainty classes in this experiment, we only focus on an edge subspace comprised of 1533 edges, which connect genes directly regulating the reactions involving biomass production in the iAF1260 metabolic network model. We obtain 100 ordering samples by bootstrapping the order search over the simulated gene expression data set of 1000 samples, similarly as described in the small-scale model experiment.

In this set of experiments, the cyclic graph deduced from the interaction list is close to an acyclic graph with only 20 SCCs containing more than two nodes, for which one time bootstrapping of Tabu order-based search can be completed within 1–2 minutes. The corresponding covariance matrix is shown in Figure 3b. To construct the BN uncertainty classes, we first fix the BN structure for the genes that are not associated with biomass-related reactions to the optimal structure consistent with the derived mean topological ordering. The run-time of edge ranking for this larger graph is ∼10 minutes. We then sample different sizes of edge sets from the focused edge subspace. The corresponding BN edge space for the uncertainty classes also grows from containing $10 %$ of the focused edge subspace to the complete space under consideration. Figure 4b illustrates the performance under uncertainty.

We observe similar trends as in the small-scale model sensitivity analysis. In general, both metabolic prediction and sensitivity performances improve with the growing edge space. Owing to the integration of the EcoMAC interaction list as prior knowledge for defining the BN edge space, we can achieve satisfactory prediction performance, PCC $>$ 0.95, when we cover >50% of the focused edges. In contrast, to achieve stable predictions, we may need to cover >70% of the defined edge space. In summary, all the experimental results again verify the effectiveness of our proposed method in quantifying the metabolic flux prediction uncertainty with respect to BN structures and identifying important edges that may directly affect metabolic predictions.

5. CONCLUSION

From our experimental results, it can be observed that when the BN models deviate more from the optimal BN by missing highly ranked critical edges, the prediction accuracy, measured by correlation between ground-truth biomass fluxes and predicted fluxes of the perturbed BN models, does decrease. More critically, when constructing such uncertainty model classes and investigating prediction sensitivity, our results indicate that we may need better prior knowledge and more training data to achieve both accurate and stable predictions. Our sensitivity analyses also indicate that reliable uncertainty quantification may require more data. Although many existing works have reported valid performances in selected experiments, there may still be potential overfitting risks with predictions not easy to generalize when having slightly perturbed systems.

Our topological ordering-based sensitivity analysis also helps identify the set of edges, for which the corresponding uncertainty reduction can significantly help model prediction and improve sensitivity. Such a capability can lead to new uncertainty quantification formulations, which may enable OED strategies for active model learning (Zhao et al, 2021a; Zhao et al, 2021b; Zhao et al, 2021c) and more robust intervention strategies in metabolic engineering, which we leave for future research.

Footnotes

AUTHORs' CONTRIBUTIONS

Methodology, investigation, software, writing—original draft, and writing—review and editing by P.N. Investigation, validation, and writing—review and editing by M.J.S. Conceptualization, methodology, and writing—review and editing by S.H. Conceptualization, methodology, investigation, formal analysis, writing—review and editing, and funding acquisition by B.-J.Y. Conceptualization, investigation, formal analysis, writing—review and editing, and funding acquisition by E.R.D. Conceptualization, investigation, formal analysis, writing—review and editing, resources, and funding acquisition by F.J.A. Conceptualization, investigation, validation, writing—review and editing, funding acquisition, resources, and supervision by I.B. Conceptualization, methodology, investigation, formal analysis, writing—original draft, writing—review and editing, funding acquisition, resources, and supervision by X.Q.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

P.N. and X.Q. are partially supported by the National Science Foundation under grants CCF-1553281. This study has been supported by the DOE Joint Genome Institute () by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, through contract DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the U.S. Department of Energy. This presented material is based upon the study supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract number DE-0012704. Publication fees are supported by the U.S. Department of Energy, Office of Science, RadBio program, under Award KP1601011/FWP CC121.

References

Apaydin

, Xu

, Zeng

, et al. Robust mutant strain design by pessimistic optimization. BMC Genom, 2017; 18(6):677.

Apaydin

, Zeng

, Qian

. A reliable alternative of optknock for desirable mutant microbial strains. In: 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE; 2016; pp. 573–576.

Balsa-Canto

, Bandiera

, Menolascina

. Optimal experimental design for systems and synthetic biology using AMIGO2. Methods Mol Biol, 2021;(2229):221–239.

Barrett

, Kim

, et al. Systems biology as a foundation for genome-scale synthetic biology. Curr Opin Biotechnol, 2006; 17(5):488–492.

Bro

, Regenberg

, Förster

, et al. In silico aided metabolic engineering of saccharomyces cerevisiae for improved bioethanol production. Metab Eng, 2006; 8(2):102–111.

Burgard

, Pharkya

, Maranas

. Optknock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnol Bioeng, 2003; 84(6):647–657.

Carrera

, Estrela

, Luo

, et al. An integrative, multi-scale, genome-wide model reveals the phenotypic landscape of E scherichia coli. Mol Syst Biol, 2014; 10(7):735.

Chandrasekaran

, Price

. Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis. Proc Natl Acad Sci, 2010; 107(41):17845–17850; doi: 10.1073/pnas.1005139107

Covert

, Palsson

. Constraints-based models: Regulation of gene expression reduces the steady-state solution space. J Theor Biol, 2003; 221(3):309–325; doi: 10.1006/jtbi.2003.3071

10.

Covert

, Xiao

, Chen

, et al. Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics,, 2008; 24(18):2044–2050; doi: 10.1093/bioinformatics/btn352

11.

Edwards

, Palsson

BØ

. The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc Natl Acad Sci, 2000; 97(10):5528–5533.

12.

Esvelt

, Wang

. Genome-scale engineering for systems and synthetic biology. Mol Syst Biol, 2013; 9(1):641.

13.

Friedman

, Goldszmidt

, Wyner

Data analysis with Bayesian networks: A bootstrap approach. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence; 1999; pp. 196–205.

14.

Haro

M-A

, de Lorenzo

. Metabolic engineering of bacteria for environmental applications: Construction of pseudomonas strains for biodegradation of 2-chlorotoluene. J Biotechnol, 2001; 85(2):103–113.

15.

James

, Rego

, Glover

. Multistart tabu search and diversification strategies for the quadratic assignment problem. IEEE Trans Syst Man Cybern Part A Syst Humans, 2009; 39(3):579–596.

16.

Jensen

, Lutz

, Papin

. TIGER: Toolbox for integrating genome-scale metabolic models, expression data, and transcriptional regulatory networks. BMC Syst Biol, 2011; 5:147; doi: 10.1186/1752-0509-5-147

17.

King

, Lu

, Dräger

, et al. Bigg models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res, 2016; 44(D1):D515–D522.

18.

Lewis

, Hixson

, Conrad

, et al. Omic data from evolved E. coli are consistent with computed optimal growth from genome-scale models. Mol Syst Biol, 2010; 6(1):390; doi: 10.1038/msb.2010.47

19.

Lü

, Sheahan

, Fu

. Metabolic engineering of algae for fourth generation biofuels production. Energy Environ Sci, 2011; 4(7):2451–2466.

20.

Luengo

, Garcia

, Sandoval

, et al. Bioplastics from microorganisms. Curr Opin Microbiol, 2003; 6(3):251–260.

21.

Machado

, Herrgård

. Systematic evaluation of methods for integration of transcriptomic data into constraint-based models of metabolism. PLoS Comput Biol, 2014; 10(4):e1003580; doi: 10.1371/journal.pcbi.1003580

22.

Mahadevan

, Schilling

. The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metab Eng, 2003; 5(4):264–276; doi: 10.1016/j.ymben.2003.09.002

23.

Motamedian

, Mohammadi

, Shojaosadati

, et al. TRFBA: An algorithm to integrate genome-scale metabolic and transcriptional regulatory networks with incorporation of expression data. Bioinformatics, 2017; 33(7):1057–1063.

24.

Niu

, Soto

, Yoon

B-J

, et al. Trimer: Transcription regulation integrated with metabolic regulation. Iscience, 2021; 24(11):103218.

25.

Niu

, Soto

, Yoon

B-J

, et al. Protocol for condition-dependent metabolite yield prediction using the trimer pipeline. STAR Protoc, 2022; 3(1):101184.

26.

Ohta

, Beall

, Mejia

, et al. Metabolic engineering of klebsiella oxytoca m5a1 for ethanol production from xylose and glucose. Appl Environ Microbiol, 1991; 57(10):2810–2815.

27.

Palsson

Systems Biology. Cambridge University Press; 2015.

28.

Reed

. Genome-scale metabolic modeling and its application to microbial communities. In: National Academies of Sciences, Engineering, and Medicine. 2017. The Chemistry of Microbiomes: Proceedings of a Seminar Series. Washington, DC: The National Academies Press, Washington, DC; doi: https://doi.org/10.17226/24751.

29.

Ren

, Zeng

, Qian

. Adaptive bi-level programming for optimal gene knockouts for targeted overproduction under phenotypic constraints. BMC Bioinform, 2013; 14(S2):S17.

30.

Segre

, Vitkup

, Church

. Analysis of optimality in natural and perturbed metabolic networks. Proc Natl Acad Sci, 2002; 99(23):15112–15117.

31.

Shlomi

, Berkman

, Ruppin

. Regulatory on/off minimization of metabolic flux changes after genetic perturbations. Proc Natl Acad Sci, 2005; 102(21):7695–7700.

32.

Shlomi

, Eisenberg

, Sharan

, et al. A genome-scale computational study of the interplay between transcriptional regulation and metabolism. Mol Syst Biol, 2007; 3(1):101; doi: 10.1038/msb4100141

33.

Tarjan

Depth-first search and linear graph algorithms. SIAM J Comput, 1972; 1(2):146–160.

34.

Teyssier

, Koller

. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. arXiv preprint, 2012; arXiv:1207.1429.

35.

Varma

, Palsson

BØ

. Metabolic flux balancing: Basic concepts, scientific and practical use. Biotechnology, 1994; 12(10):994–998.

36.

Xiao

, Jin

, Liu

, et al. Optimal expert knowledge elicitation for Bayesian network structure identification. IEEE Trans Autom Sci Eng, 2018; 15(3):1163–1177.

37.

, Blair

. Integration of probabilistic regulatory networks into constraint-based models of metabolism with applications to Alzheimer's disease. BMC Bioinform, 2019; 20(1):386.

38.

Zhao

, Dougherty

, Yoon

B-J

, et al. Efficient active learning for Gaussian process classification by error reductiony. In: 35th International Conference on Neural Information Processing Systems (NeurIPS); 2021a.

39.

Zhao

, Dougherty

, Yoon

B-J

, et al. Bayesian active learning by soft mean objective cost of uncertainty. In: 24th International Conference on Artificial Intelligence and Statistics (AISTATS); 2021b.

40.

Zhao

, Dougherty

, Yoon

B-J

, et al. Uncertainty-aware active learning for optimal Bayesian classifier. In: 9th International Conference on Learning Representations (ICLR); 2021c.

41.

Zhao

, Qian

, Yoon

B-J

, et al. Model-based robust filtering and experimental design for stochastic differential equation systems. IEEE Trans Signal Process, 2020; 68:3849–3859.