Metabolic Pathway Prediction Using Non-Negative Matrix Factorization with Improved Precision

Abstract

Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges, including pathway features engineering, multiple mapping of enzymatic reactions, and emergent or distributed metabolism within populations or communities of cells, can limit prediction performance. In this article, we present triUMPF (triple non-negative matrix factorization [NMF] with community detection for metabolic pathway inference), which combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract a higher-order structure based on the clustering of vertices that share similar statistical properties. We evaluated triUMPF performance by using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.

1. Introduction

Pathway reconstruction from genomic sequence information is an essential step in describing the metabolic potential of cells at the individual, population, and community levels of biological organization (Konwar et al., 2013; Hanson et al., 2014; Basher et al., 2020). Resulting pathway representations provide a foundation for defining regulatory processes, modeling metabolite flux, and engineering cells and cellular consortia for defined process outcomes (Hahn et al., 2016; Lawson et al., 2019). The integral nature of the pathway prediction problem has prompted both gene-centric, for example, mapping annotated proteins onto known pathways by using a reference database based on sequence homology, and heuristic or rule-based pathway-centric approaches, including PathoLogic (Karp et al., 2016) and MinPath (Ye and Doak, 2009). In parallel, the development of trusted sources of curated metabolic pathway information, including the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017) and MetaCyc (Caspi et al., 2016b), provides training data for the design of more flexible machine learning (ML) algorithms for pathway inference. Although ML approaches have been adopted widely in metabolomics research (Carbonell et al., 2018; Toubiana et al., 2019), they have gained less traction when applied to predicting pathways directly from annotated gene lists.

Dale et al. (2010) conducted the first in-depth exploration of ML approaches for pathway prediction using Tier 1 (T1) organismal Pathway/Genome Databases (PGDBs) (Caspi et al., 2016a) from the BioCyc collection randomly divided into training and test sets. Features were developed based on rule-sets used by the PathoLogic algorithm in Pathway Tools to construct PGDBs (Karp et al., 2016). Resulting performance metrics indicated that standard ML approaches rivaled PathoLogic performance with the added benefit of probability scores (Dale et al., 2010). More recently, Basher et al. (2020) developed multi-label based on logistic regression for pathway prediction (mlLGPR), a multi-label classification approach that uses logistic regression and feature vectors inspired by the work of Dale et al. (2010) to predict metabolic pathways from genomic sequence information at different levels of complexity and completion.

Although mlLGPR performed effectively on organismal genomes, pathway prediction outcomes for multi-organismal datasets were less optimal due in part to missing or noisy feature information. In an effort to solve this problem, Basher and Hallam (2020) evaluated the use of representational learning methods to learn a neural embedding-based low-dimensional space of metabolic features based on a three-layered network architecture consisting of compounds, enzymes, and pathways. Learned feature vectors improved pathway prediction performance on organismal genomes and motivated the use of graphical models for multi-organismal features engineering.

Here, we describe triple non-negative matrix factorization (NMF) with community detection for metabolic pathway inference (triUMPF) combining three stages of NMF to capture relationships between enzymes and pathways within a network (Fu et al., 2019) followed by community detection to extract a higher-order network structure (Fortunato and Hric, 2016). The NMF is a data reduction and exploration method in which the original and factorized matrices have the property of non-negative elements with reduced ranks or features (Fu et al., 2019). In contrast to other dimension reduction methods, such as principal component analysis (Bro and Smilde, 2014), NMF both reduces the number of features and preserves information needed to reconstruct the original data (Yang and Michailidis, 2015). This has important implications for noise robust feature extraction from sparse matrices, including datasets associated with gene expression analysis and pathway prediction (Yang and Michailidis, 2015).

For pathway prediction, triUMPF uses three graphs, one representing associations between pathways and enzymes (P2E) indicated by enzyme commission (EC) numbers (Bairoch, 2000), one representing interactions between enzymes, and another representing interactions between pathways. The two interaction graphs adopt the subnetworks concept introduced in BiomeNet (Shafiei et al., 2014) and MetaNetSim (Jiao et al., 2013), where a subnetwork is a linked series of connected nodes (e.g., reactions and pathways). In the literature, a subnetwork is commonly referred to as a community (Rossi et al., 2020), which defines a set of densely connected nodes within a subnetwork. It is important to emphasize that unless otherwise indicated, the use of the term “community” in this work refers to a subnetwork community based on statistical properties of a network rather than a community of organisms. Community detection is performed on both interaction graphs (pathways and enzymes) to identify subnetworks among pathways.

We evaluated triUMPFs prediction performance in relation to other methods, including MinPath, PathoLogic, and mlLGPR on a set of T1 PGDBs, low complexity microbial communities including symbiont genomes encoding distributed metabolic pathways for amino acid biosynthesis (McCutcheon and Von Dohlen, 2011), genomes used in the Critical Assessment of Metagenome Interpretation (CAMI) initiative (Sczyrba et al., 2017), and whole genome shotgun sequences from the Hawaii Ocean Time Series (HOTS) (Stewart et al., 2011) following information hierarchy-based benchmarks initially developed for mlLGPR enabling a more robust comparison between pathway prediction methods (Basher et al., 2020).

2. Methods

In this section, we provide a general description of triUMPF components, presented in Figure 1. At the very beginning, MetaCyc is applied to: (i) extract three association matrices, indicated in step Figure 1a, one representing associations between P2E indicated by EC numbers (McDonald et al., 2009), one representing interactions between enzymes (E2E), and another representing interactions between pathways (P2P), and (ii) automatically generate features corresponding pathways and enzymes (or EC) from pathway2vec (Basher and Hallam, 2020) in Figure 1b. Then, triUMPF is trained in two phases: (i) decomposition of the pathway EC association matrix in Figure 1c, and (ii) subnetwork or community reconstruction while, simultaneously, learning optimal multi-label pathway parameters in Figure 1d–f. Later, we discuss these two phases while the analytical expressions of triUMPF are explained in Sections 5.1–5.3.

FIG. 1.

A workflow diagram showing the proposed triUMPF method. Initially, triUMPF takes the P2E information (a) to produce several low rank matrices (c) while, simultaneously, detecting pathway and EC communities (d) given two interaction matrices, corresponding to P2P and E2E (a). For both steps (c, d), pathway and EC features obtained from pathway2vec package (b) are utilized. Afterward, triUMPF iterates between updating community parameters (d) and optimizing multi-label parameters (e) with the use of training data (f). Once the training is achieved, the learned model (g) can be used to predict a set of pathways (i, j) from an organismal genome or multi-organismal dataset (h). E2E, EC-EC; EC, enzyme commission; P2E, pathway-EC association; P2P, pathway-pathway; triUMPF, triple non-negative matrix factorization with community detection for metabolic pathway inference.

2.1. Decomposing the pathway EC association matrix

Inspired by the idea of NMF, we decompose the P2E association matrix to recover low-dimensional latent factor matrices (Fu et al., 2019). Unlike previous application of NMF to biological data (Natarajan and Dhillon, 2014), triUMPF incorporates constraints into the matrix decomposition process. Formally, let $M \in ℤ_{\geq 0}^{t \times r}$ be a non-negative matrix, where t is the number of pathways and r is the number of enzymatic reactions. Each row in $M$ corresponds to a pathway and each column represents an EC, such that $M_{i, j} = 1$ if an EC j is in pathway i and 0 otherwise. Given $M$ , the standard NMF decomposes this matrix into the two low-rank matrices, that is, $M \approx W H^{T}$ , where $W \in ℛ^{t \times k}$ stores the latent factors for pathways whereas $H \in ℛ^{r \times k}$ indicates the latent factors associated with ECs and $k (\in ℤ_{\geq 1}) ≪ t, r$ . However, triUMPF extends this standard NMF by leveraging features, obtained from pathway2vec (Basher and Hallam, 2020), encoding two interactions: (i) within ECs or pathways and (ii) between pathways and ECs. For more details about this step, please see Section 5.2.1.

2.2. Community reconstruction and multi-label learning

The community detection problem (Li et al., 2019; Rossi et al., 2020) is the task of discovering distinct groups of nodes that are densely connected. During this phase, triUMPF performs community detection to guide the learning process for pathways using binary P2P ( $A \in ℤ_{\geq 0}^{t \times t}$ ) and E2E ( $B \in ℤ_{\geq 0}^{r \times r}$ ) association matrices, where each entry in these matrices is a binary value indicating an interaction among corresponding entities. However, $A$ and $B$ capture pairwise first-order proximity among their related entities; consequently, they are inadequate to fully characterize distant relationships among pathways or ECs (Rossi et al., 2020). Therefore, triUMPF utilizes higher-order proximity by using the following formula (Li et al., 2019):

where $A^{p r o x}$ and $B^{p r o x}$ are polynomials of order $l_{p} \in Z_{> 0}$ and $l_{e} \in Z_{> 0}$ , respectively, and $ω \in R_{> 0}$ and $γ \in R_{> 0}$ are weights associated to each term. Using these higher-order matrices, triUMPF applies two NMFs to recover communities (Section 5.2.2). Then, triUMPF uses $W$ and $H$ from the decomposition phases (Section 2.1) and the detected communities to optimize multi-label pathway parameters in an iterative process (Section 5.2.3) until the maximum number of allowed iterations is reached. At the end, the trained model can be used to perform pathway prediction from an organismal genome or multi-organismal dataset with high precision due to constraints embedded in the P2E, P2P, and E2E association matrices.

3. Results

We evaluated triUMPF performance across multiple datasets spanning the genomic information hierarchy (Basher et al., 2020): (i) T1 golden consisting of EcoCyc, HumanCyc, AraCyc, YeastCyc, LeishCyc, and TrypanoCyc; (ii) three Escherichia coli genomes composed of E. coli K-12 substr. MG1655 (TAX-511145), uropathogenic E. coli str. CFT073 (TAX-199310), and enterohemorrhagic E. coli O157:H7 str. EDL933 (TAX-155864); (iii) BioCyc (v20.5 T2 & 3) (Caspi et al., 2016a) composed of 9255 PGDBs with 1463 pathways constructed using Pathway Tools v21 (Karp et al., 2016); (iv) symbionts genomes of Moranella (GenBank NC-015735) and Tremblaya (GenBank NC-015736) encoding distributed metabolic pathways for amino acid biosynthesis (McCutcheon and Von Dohlen, 2011); (v) CAMI initiative low complexity dataset consisting of 40 genomes (Sczyrba et al., 2017); and (vi) whole genome shotgun sequences from the HOTS at 25 m, 75 m, 110 m (sunlit), and 500 m (dark) ocean depth intervals (Stewart et al., 2011). We applied BioCyc v20.5 to train triUMPF while the remaining datasets were used to report performance results. Since BioCyc v20.5 contains less than 1460 trainable pathways, we applied pathway2vec with RUST-norm (or “crt”) configuration to improve prediction (Section 5.4.3). In general, statistics about these datasets are summarized in Table 4.

For comparative analysis, triUMPFs performance on T1 golden datasets was compared with three pathway prediction methods: (i) MinPath version 1.2 (Ye and Doak, 2009), which uses integer programming to recover a conserved set of pathways from a list of enzymatic reactions; (ii) PathoLogic version 21 (Karp et al., 2016), which is a symbolic approach that uses a set of manually curated rules to predict pathways; and (iii) mlLGPR, which uses supervised multi-label classification and rich feature information to predict pathways from a list of enzymatic reactions (Basher et al., 2020). In addition to testing on T1 golden datasets, triUMPF performance was compared with PathoLogic on three E. coli genomes and with PathoLogic and mlLGPR on mealybug symbionts, CAMI low complexity, and HOTS multi-organismal datasets. The following metrics were used to report on performance of pathway prediction algorithms, including: average precision, average recall, average F1 score (F1), and Hamming loss as described in Basher et al. (2020). For experimental settings and additional tests, see Sections 5.4 and 5.5.

3.1. T1 golden data

As shown in Table 1, triUMPF achieved competitive performance against the other methods in terms of average precision with optimal performance on EcoCyc (0.8662). However, with respect to average F1 scores, it underperformed on HumanCyc and AraCyc, yielding average F1 scores of 0.4703 and 0.4775, respectively (Table 5). Since the observed number of pathway labels in BioCyc v20.5 is 1463 pathways (a subset of 2526 MetaCyc pathways) (as explained in Section 3), triUMPF trained with these data (using features from pathway2vec) (Basher and Hallam, 2020) cannot infer pathways outside the trainable pathways. Consequently, this has translated into low average F1 scores of HumanCyc and AraCyc. A possible treatment would be incorporating additional PGDBs containing more pathways to train triUMPF. However, this would require substantially building many PGDBs from organismal genomes or using multiple versions of BioCyc data. A detailed analysis on this is left for future work.

Table 1.

Average Precision of Each Comparing Algorithm on Six Golden T1 Data

Methods	Average precision score
Methods	EcoCyc	HumanCyc	AraCyc	YeastCyc	LeishCyc	TrypanoCyc
PathoLogic	$0.7230$	$0 . 6695$	$0.7011$	$0.7194$	$0 . 4803$	$0 . 5480$
MinPath	$0.3490$	$0.3004$	$0.3806$	$0.2675$	$0.1758$	$0.2129$
mlLGPR	$0.6187$	$0.6686$	$0.7372$	$0.6480$	$0.4731$	$0.5455$
triUMPF	$0 . 8662$	$0.6080$	$0 . 7377$	$0 . 7273$	$0.4161$	$0.4561$

Values in boldface represent the best performance score.

3.2. Three E. coli data

It should be recalled that community detection (Section 2.2) was used to guide the multi-label learning process. To demonstrate the influence of communities on pathway prediction, we compared pathways predicted for the T1 gold standard E. coli K-12 substr. MG1655 (TAX-511145), henceforth referred to as MG1655, using PathoLogic and triUMPF. Figure 8a shows the results, where both methods inferred 202 true-positive pathways (green-colored) in common out of 307 expected true-positive pathways (using EcoCyc as a common frame of reference). In addition, PathoLogic uniquely predicted 39 (magenta-colored) true-positive pathways whereas triUMPF uniquely predicted 16 true-positive pathways (purple-colored). This difference arises from the use of taxonomic pruning in PathoLogic, which improves the recovery of taxonomically constrained pathways and limits false-positive identification. When taxonomic pruning was enabled, PathoLogic inferred 79 false-positive pathways; when pruning was disabled, it inferred more than 170 false-positive pathways. In contrast, triUMPF, which does not use taxonomic feature information, inferred 84 false-positive pathways. This improvement over PathoLogic with pruning disabled reinforces the idea that pathway communities improve the precision of pathway prediction with limited impact on overall recall. Based on these results, it is conceivable to train triUMPF on subsets of organismal genomes, resulting in more constrained pathway communities for pangenome analysis.

To further evaluate triUMPF performance on closely related organismal genomes, we performed pathway prediction on E. coli str. CFT073 (TAX-199310) and E. coli O157:H7 str. EDL933 (TAX-155864), and we compared results with the MG1655 reference strain (Welch et al., 2002). Both CFT073 and EDL933 are pathogens infecting the human urinary and gastrointestinal tracts, respectively. Previously, Welch et al. (2002) described extensive genomic mosaicism between these strains and MG1655, defining a core backbone of conserved metabolic genes interspersed with genomic islands encoding common pathogenic or niche defining traits. Neither CFT073 nor EDL933 genomes are represented in the BioCyc collection of organismal pathway genome databases. A total of 335 and 319 unique pathways were predicted by PathoLogic and triUMPF, respectively. The resulting pathway lists were used to perform a set-difference analysis with MG1655 (Fig. 2). Both methods predicted more than 200 pathways encoded by all 3 strains, including core pathways such as the TCA cycle (Fig. 8b, c). CFT073 and EDL933 were predicted to share a single common pathway (TCA cycle IV [2-oxoglutarate decarboxylase]) by triUMPF.

FIG. 2.

A three-way set difference analysis of pathways predicted for Escherichia coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864) using (a) PathoLogic (taxonomic pruning) and (b) triUMPF.

However, this pathway variant has not been previously identified in E. coli and is likely a false-positive prediction based on a recognized taxonomic range. Both PathoLogic and triUMPF predicted the aerobactin biosynthesis pathway involved in siderophore production in CFT073 consistent with previous observations (Welch et al., 2002). Similarly, four pathways (e.g., l-isoleucine biosynthesis III and GDP-d-perosamine biosynthesis) unique to EDL933 were inferred by both methods.

Given the lack of cross-validation standards for CFT073 and EDL933, we were unable to determine which method inferred fewer false-positives across the complete set of predicted pathways. To constrain this problem on a subset of the data, we applied GapMind (Price et al., 2018) to analyze amino acid biosynthesis pathways encoded in MG1655, CFT073, and EDL933 genomes. GapMind is a web-based application developed for annotating amino acid biosynthesis pathways in prokaryotic microorganisms (bacteria and archaea), where each reconstructed pathway is supported by a confidence level. After excluding pathways that were not incorporated in the training set, a total of 102 pathways were identified across the 3 strains encompassing 18 amino acid biosynthesis pathways and 27 pathway variants with high confidence (Table 7). PathoLogic inferred 49 pathways identified across the 3 strains encompassing 15 amino acid biosynthesis pathways and 17 pathway variants, whereas triUMPF inferred 54 pathways identified across the 3 strains encompassing 16 amino acid biosynthesis pathways and 19 pathway variants including l-methionine biosynthesis in MG1655, CFT073, and EDL933 that were not predicted by PathoLogic. Neither method was able to predict l-tyrosine biosynthesis I (Fig. 10).

3.3. Mealybug symbionts data

To evaluate triUMPF performance on distributed metabolic pathways, we used the reduced genomes of Moranella and Tremblaya (McCutcheon and Von Dohlen, 2011). Collectively, the two symbiont genomes encode intact biosynthesis pathways for nine essential amino acids. PathoLogic, mlLGPR, and triUMPF were used to predict pathways on individual symbiont genomes and a composite genome consisting of both, and resulting amino acid biosynthesis pathway distributions were determined (Fig. 3). Both triUMPF and PathoLogic predicted six of the expected amino acid biosynthesis pathways on the composite genome, whereas mlLGPR predicted eight pathways. The pathway for phenylalanine biosynthesis (l-phenylalanine biosynthesis I) was excluded from analysis, because the associated genes were reported to be missing during the open reading frame (ORF) prediction process. False positives were predicted for individual symbiont genomes in Moranella and Tremblaya using both methods, although pathway coverage was reduced in relation to the composite genome.

FIG. 3.

Comparative study of predicted pathways for symbiotic data between PathoLogic, mlLGPR, and triUMPF. The size of circles corresponds to the associated coverage information. mlLGPR, multi-label based on logistic regression for pathway prediction.

3.4. CAMI and HOTS data

To evaluate triUMPFs performance on more complex multi-organismal genomes, we used the CAMI low complexity (Sczyrba et al., 2017) and HOTS datasets (Stewart et al., 2011) comparing resulting pathway predictions with both PathoLogic and mlLGPR. For CAMI low complexity, triUMPF achieved an average F1 score of 0.5864 in comparison to 0.4866 for mlLGPR, which is trained with more than 2500 labeled pathways (Table 2). Similar results were obtained for HOTS (Section 5.5.4). Among a subset of 180 selected water column pathways, PathoLogic and triUMPF predicted a total of 54 and 58 pathways, respectively, whereas mlLGPR inferred 62. From a real-world perspective, none of the methods predicted pathways for photosynthesis light reaction nor pyruvate fermentation to (S)-acetoin although both are expected to be prevalent in the water column. Perhaps, the absence of specific ECs associated with these pathway limits rule-based or ML prediction. Indeed, closer inspection revealed that the enzyme catabolic acetolactate synthase was missing from the pyruvate fermentation to the (S)-acetoin pathway, which is an essential rule encoded in PathoLogic and represented as a feature in mlLGPR. Conversely, although this pathway was indexed to a community, triUMPF did not predict its presence, constituting a false-negative.

Table 2.

Predictive Performance of mlLGPR and triUMPF on Critical Assessment of Metagenome Interpretation Low Complexity Data

Metric	mlLGPR	triUMPF
Hamming loss (↓)	$0.0975$	$0 . 0436$
Average precision score (↑)	$0.3570$	$0 . 7027$
Average recall score (↑)	$0 . 7827$	$0.5101$
Average F1 score (↑)	$0.4866$	$0 . 5864$

For each performance metric, “↓” indicates the smaller score is better whereas “↑” indicates the higher score is better.

Values in boldface represent the best performance score.

4. Discussion and Conclusion

In this article, we introduced a novel ML approach for metabolic pathway inference that combines three stages of NMF to capture relationships between enzymes and pathways within a network followed by community detection to extract higher order network structure. First, a Pathway-EC association ( $M$ ) matrix, obtained from MetaCyc, is decomposed by using the NMF technique to learn a constrained form of the pathway and EC factors, capturing the microscopic structure of $M$ . Then, we obtain the community structure (or mesoscopic structure) jointly from both the input datasets and two interaction matrices, pathway–pathway interaction, and EC–EC interaction. Finally, the consensus relationships between the community structure and data, and between the learned factors from $M$ and the pathway labels coefficients are exploited to efficiently optimize metabolic pathway parameters.

We evaluated triUMPF performance by using a corpora of experimental datasets manifesting diverse multi-label properties comparing pathway prediction outcomes with other prediction methods, including PathoLogic (Karp et al., 2016) and mlLGPR (Basher et al., 2020). During benchmarking, we realized that the BioCyc collection suffers from a class imbalance problem (He and Garcia, 2009) where some pathways infrequently occur across PGDBs. This results in a significant sensitivity loss on T1 golden data, where triUMPF tended to predict more frequently observed pathways while missing more infrequent pathways. One potential approach to solve this class-imbalance problem is subsampling the most informative PGDBs for training, hence, reducing false-positives (Lakshminarayanan et al., 2017). Despite the observed class imbalance problem, triUMPF improved pathway prediction precision without the need for taxonomic rules or EC features to constrain metabolic potential. From an ML perspective this is a promising outcome considering that triUMPF was trained on a reduced number of pathways relative to mlLGPR. Future development efforts will explore subsampling approaches to improve sensitivity and the use of constrained taxonomic groups for pangenome and multi-organismal genome pathway inference.

5. Appendices

5.1. Appendix A1: definitions and problem formulation

Here, the default vector is considered to be a column vector and is represented by a boldface lowercase letter (e.g., $x$ ) whereas matrices are represented by boldface uppercase letters (e.g., $X$ ). The $X_{i}$ matrix indicates the ith row of $X$ , and $X_{i, j}$ denotes the $(i, j)$ th entry of $X$ whereas, for a vector, $x_{i}$ denotes an ith cell of $x$ . The transpose of $X$ is denoted as $X^{T}$ , and the trace of it is symbolized as $t r (X)$ . The Frobenius norm of $X$ is defined as $| | X | |_{F} = \sqrt{\sum_{i \in n} \sum_{j \in m} X_{i, j}^{2}}$ . Occasional superscript, $x^{(i)}$ , suggests an index to a sample, a power, or a current epoch during a learning period. We use calligraphic letters to represent sets (e.g., $ℰ$ ) whereas we use the notation $| . |$ to denote the cardinality of a given set. With these notations in mind, we introduce several concepts that are integral to the problem formulation.

Metabolic pathway inference from genomic sequence information at different levels of complexity and completion requires a trusted source of labeled pathway information in which the set of ordered reactions within and between cells is linked to substrates and products (compounds or metabolites). This information can be represented in graphs corresponding to reactome and pathway-level interactions. In this study, we use MetaCyc, a multi-organism member of the BioCyc collection of PGDBs as the trusted source for reactome and pathway information (Caspi et al., 2016a). MetaCyc contains only experimentally validated metabolic pathways across all domains of life. To simplify computational complexity, we consider the reaction and pathway graphs to be undirected.

Definition 1. Reaction Graph Topology. Let the reaction graph be represented by an undirected graph $G^{(r x n)} = {C, Z^{(c)}}$ , where $C$ is a set of c metabolites and $ℤ^{(c)}$ represents $r'$ links between compounds. Each link indicates a reaction, derived from a set of biochemical reactions $ℛ$ of size $r'$ . Then, the reaction graph topology is defined by a matrix $Ω^{(c)} \in ℤ_{\geq 0}^{r' \times c}$ , where each entry $Ω_{i, j}^{(c)}$ is a binary value of 1 or 0, indicating either the compound j is a substrate/product in a reaction i or not involved in that reaction, respectively.

Definition 2. Pathway Graph Topology. Let $G^{(p a t h)} = {ℛ, ℤ'^{(r)}}$ be an undirected graph, where $ℛ$ is presented in Definition 1, and $ℤ'^{(r)}$ represents a set of $t'$ links between reactions. Then, the pathway graph topology is defined by a matrix $Ω^{(r)} \in ℤ_{\geq 0}^{t \times r'}$ , where each entry $Ω_{i, j}^{(r)}$ is either 0 or a positive integer, corresponding to the absence or the frequency of the reaction j in pathway i, respectively. And, t is the number of pathways in a set $T$ .

Note that reactions in $G^{(p a t h)}$ may be annotated as a spontaneous reaction or a reaction catalyzed by one or more enzymes, enzymatic reaction and classified by an EC number (McDonald et al., 2009). In addition, a number of enzymes referred to as promiscuous enzymes can participate in more than one pathway. Given this information, we associate EC numbers to pathways and formulate three graphs, one representing associations between pathways and enzymes indicated by EC numbers, one representing interactions between enzymes, and another representing interactions between pathways.

Definition 3. Pathway-EC Association (P2E). Let $G'^{, (p a t h)} = {ℰ, ℤ^{(r)}}$ be a subgraph of $G^{(p a t h)}$ , such that $ℰ \subset ℛ$ with $r ≪ r'$ enzymatic reactions. Then, the Pathway-EC association is defined as a matrix $M \in ℤ_{\geq 0}^{t \times r}$ , where each row corresponds to a pathway, and each column represents an EC, such that $M_{i, j} = 1$ if an EC j is in pathway i and 0 otherwise.

Typically, the association matrix $M$ is extremely sparse. Using reaction and pathway graph topology, we build interaction adjacency matrices as follows:

Definition 4. EC-EC Interaction (E2E). Given $G'^{(r x n)} \subset G^{(r x n)}$ , we define an EC–EC interaction matrix $B \in ℤ_{\geq 0}^{r \times r}$ such that an entry $B_{i, j}$ is a binary value encoding an interaction between two ECs i and j if they both share a compound, that is, $Ω_{i, k}^{(c)} \land Ω_{j, k}^{(c)} = 1$ where $k \in C$ .

Definition 5. Pathway–Pathway Interaction (P2P). Given $G^{(p a t h)}$ , we define a Pathway–Pathway interaction matrix $A \in ℤ_{\geq 0}^{t \times t}$ such that an entry $A_{i, j}$ is a binary value indicating an interaction between pathways i and j if there exists a reaction $k \in ℛ$ where associated compounds are either substrate or product in both i and j pathways.

After determining relationships within each graph, we define a multi-label metabolic pathway dataset.

Definition 6. Multi-label Pathway Dataset (Basher et al., 2020). A general form of pathway dataset is characterized by $S = {(x^{(i)}, y^{(i)}) : 1 < i \leq n}$ consisting of n examples, where $x^{(i)}$ is a vector indicating the abundance information corresponding to each enzymatic reaction. An enzymatic reaction, in turn, is denoted by e, which is an element of a set of enzymatic reactions $ℰ = {e_{1}, e_{2}, \dots, e_{r}}$ , having r possible reactions. The abundance of an enzymatic reaction i, for example, $e_{l}^{(i)}$ , is defined as $a_{l}^{(i)} (\in ℛ_{\geq 0})$ . The class labels $y^{(i)} = [y_{1}^{(i)}, \dots, y_{t}^{(i)}] \in {- 1, + 1}^{t}$ is a pathway label vector of size t that represents the total number of pathways, which are derived from a set of the labeled metabolic pathway ${$ . The matrix form of $x^{(i)}$ and $y^{(i)}$ are symbolized as $X$ and $Y$ , respectively.

The input space is assumed to be encoded as r-dimensional feature vector and is symbolized as $ℛ = ℛ^{r}$ . Further, each example in $S$ is considered to be drawn independent, identically distributed from an unknown distribution $ℛ$ over $ℛ \times 2^{| ℛ |}$ . Now we state the problem considered in this article.

Metabolic Pathway Prediction. Given: (i) Pathway–EC matrix $M$ , (ii) a pathway–pathway interaction matrix $A$ , (iii) an EC–EC interaction matrix $B$ , and (iv) a dataset $S$ , the goal is to efficiently reconstruct pathway labels for a hitherto unseen instance $x *$ .

5.2. Appendix A2: detailed description of triUMPF method

In this section, we provide a description of triUMPF components, presented in Figure 1 of the main manuscript, including: (i) decomposing the pathway EC association matrix, (ii) subnetwork or community reconstruction, and (iii) the multi-label learning process.

5.2.1. Decomposing the pathway–EC association matrix

Given the non-negative $M$ , we formulate the following minimization objective function: $\begin{matrix} J^{f a c t} (W, H, U, V) = {min}_{W, H, U, V} | | M - W H^{T} | |_{F}^{2} + λ_{1} | | W - P U | |_{F}^{2} \\ + λ_{2} | | H - E V | |_{F}^{2} + λ_{3} | | U - V | |_{F}^{2} \\ + λ_{4} (| | W | |_{F}^{2} + | | H | |_{F}^{2} + | | U | |_{F}^{2} + | | V | |_{F}^{2}) \\ s . t . {W, H, U, V} \geq 0 \end{matrix}$ (2)

where $W \in ℛ^{t \times k}$ stores the latent factors for pathways whereas $H \in ℛ^{r \times k}$ , known as the basis matrix, can be thought of as latent factors associated with ECs and $k ≪ t, r$ and $λ_{*}$ are regularization hyperparameters. The leftmost term is the well-known squared loss function that penalizes the deviation of the estimated entries in both $W$ and $H$ from the true association matrix $M$ . The second term corresponds to the relative differences of latent matrix $W$ from the pathway features $P \in ℛ^{t \times m}$ , learned using pathway2vec framework, where the matrix $U \in ℛ^{m \times k}$ absorbs different scales of matrices $W$ and $P$ . Similarly, the third term indicates the squared loss of $H$ from $E \in ℛ^{r \times m}$ , which denotes the feature matrix of ECs, and their differences are captured by $V \in ℛ^{m \times k}$ . In the fourth term, we minimize the differences between factors $U$ and $V$ , capturing shared prominent features for the low dimensional coefficients.

5.2.2. Subnetwork or community reconstruction

Recalling from the main manuscript, the higher-order proximity of the two matrices $A$ and $B$ is defined according to the formula (Li et al., 2019): $A^{p r o x} = \sum_{i \in l_{p}} ω_{i} A^{l}, B^{p r o x} = \sum_{i \in l_{e}} γ_{i} B^{l}$ (3)

Formally, let $T \in ℛ^{m \times p}$ be a non-negative community representation matrix of size p communities for pathways, where the jth column in $T_{:, j}$ denotes the representation of the community j. The pathway community indicator matrix is denoted by $C \in ℛ^{t \times p}$ conditioned on $t r (C^{T} C) = t$ , where each entry $C_{i, l}$ and $C_{j, l}$ encodes the probability that pathways i and j generates an edge belonging to a community l. The probability of i and j belonging to the same community can be assessed as: . A similar discussion follows for the non-negative representation matrix $R \in ℛ^{m \times v}$ and the EC community indicator matrix $K \in ℛ^{r \times v}$ of v communities, conditioned on $t r (K^{T} K) = r$ . Unfortunately, due to the constraints emphasized on $C$ and $K$ , it is not straightforward to analytically derive an expression; instead, we resort to a more tractable solution provided in Wang et al. (2017), and we relax the condition to be an orthogonal constraint, resulting in the following objective function: $\begin{matrix} J^{c o m m} (C, K) = {min}_{C, K} | | A^{p r o x} - P T C^{T} | |_{F}^{2} \\ + | | B^{p r o x} - E R K^{T} | |_{F}^{2} \\ + α | | C^{T} C - I | |_{F}^{2} + β | | K^{T} K - I | |_{F}^{2} \\ + λ_{5} (| | C | |_{F}^{2} + | | K | |_{F}^{2}) \\ s . t . {C, K} \geq 0 \end{matrix}$ (4)

where $I$ denotes an identify matrix, $λ_{5}$ is a regularization hyperparameter, whereas both $α$ and $β$ are positive hyperparameters. The value of these hyperparameters is usually set to a large number, for example, $1 0^{9}$ in this work, for adjusting the contribution of corresponding terms. The obtained communities in Equation (4) are directly linked to the underlying graph topologies, that is, $A^{p r o x}$ and $B^{p r o x}$ .

5.2.3. Multi-label learning process

We now bring together the NMF and community detection steps with multi-label classification for pathway prediction. The learning problem must balance between information in $M$ while being lenient toward the dataset $S$ , which should provide enough evidence to generate representations of communities among pathways and ECs, as suggested by $A^{p r o x}$ and $B^{p r o x}$ . We present a weight term $Θ \in ℛ^{t \times r}$ that enforces $X$ to be close enough to both $Y$ and $M$ . We also introduce two auxiliary terms $L \in ℛ^{n \times m}$ , which capture correlations between $X$ and $Y$ and $Z \in ℛ^{r \times r}$ , enforcing the pathway coefficients associated with $M$ and resulting in the following objective function: $\begin{matrix} J^{p a t h} (T, R, Θ, L, Z) = {min}_{T, R, Θ, L, Z} \sum_{i \in n} \sum_{k \in t} log (1 + e^{- y_{k}^{(i)} Θ_{k}^{T} x^{(i)}}) \\ + | | X - L R K^{T} | |_{F}^{2} + | | Y - L T C^{T} | |_{F}^{2} \\ + ρ | | Θ - Z H W^{T} | |_{F}^{2} \\ + λ_{5} (| | T | |_{F}^{2} + | | R | |_{F}^{2}) \\ + λ_{6} (| | Θ | |_{2, 1} + | | L | |_{F}^{2} + | | Z | |_{F}^{2}) \\ s . t . {T, R} \geq 0 \end{matrix}$ (5)

where $λ_{5}$ , $λ_{6}$ , and $ρ$ are regularization hyperparameters, and $| | . | |_{2, 1}$ represents the sum of the Euclidean norms of columns of a matrix introduced to emphasize sparseness. Notice that we do not restrict the terms $L$ and $Z$ to be non-negative. Both the second and the third terms in Equation (5) are needed to discover pathway and EC communities, that is, $C$ and $K$ , respectively.

The Equations (2), (4), and (5) are jointly non-convex due to non-negative constraints on the original and the approximation factorized matrices, implying that the solutions to triUMPF are only unique up to scalings and rotations (Yang and Michailidis, 2015). Hence, we adopt an alternating optimization algorithm to solve each objective function simultaneously, which is provided in Section 5.3.

5.3. Appendix A3: pptimization

In this section, we derive the optimization for triUMPFs objective function: $J = J^{f a c t} (W, H, U, V) + J^{c o m m} (C, K) + J^{p a t h} (T, R, Θ, Z, L)$ (6)

where $\begin{matrix} J^{f a c t} (W, H, U, V) = {min}_{W, H, U, V} | | M - W H^{T} | |_{F}^{2} + λ_{1} | | W - P U | |_{F}^{2} \\ + λ_{2} | | H - E V | |_{F}^{2} + λ_{3} | | U - V | |_{F}^{2} \\ + λ_{4} (| | W | |_{F}^{2} + | | H | |_{F}^{2} + | | U | |_{F}^{2} + | | V | |_{F}^{2}) \\ s . t . {W, H, U, V} \geq 0 \\ J^{c o m m} (C, K) = {min}_{C, K} | | A^{p r o x} - P T C^{T} | |_{F}^{2} + | | B^{p r o x} - E R K^{T} | |_{F}^{2} \\ + α | | C^{T} C - I | |_{F}^{2} + β | | K^{T} K - I | |_{F}^{2} \\ + λ_{5} (| | C | |_{F}^{2} + | | K | |_{F}^{2}) \\ s . t . {C, K} \geq 0 \\ J^{p a t h} (T, R, Θ, L, Z) = {min}_{T, R, Θ, L, Z} \sum_{i \in n} \sum_{k \in t} log (1 + e^{- y_{k}^{(i)} Θ_{k}^{T} x^{(i)}}) \\ + | | X - L R K^{T} | |_{F}^{2} + | | Y - L T C^{T} | |_{F}^{2} \\ + ρ | | Θ - Z H W^{T} | |_{F}^{2} + λ_{5} (| | T | |_{F}^{2} + | | R | |_{F}^{2}) \\ + λ_{6} (| | Θ | |_{2, 1} + | | L | |_{F}^{2} + | | Z | |_{F}^{2}) \\ s . t . {T, R} \geq 0 \end{matrix}$ (7)

The objective function in Equation (7) is non-convex due to multiple non-negative constraints. Numerous algorithms have been proposed to optimize the objective function, including alternating non-negative least squares (Kim and Park, 2007) and hierarchical alternating least squares (Cichocki et al., 2007). Here, we employ the original algorithm for NMF that was introduced in Lee and Seung (2001) and consists of simple multiplicative update rules (with auxiliary variables) that are based on the gradient descent technique (Gillis, 2020). Beginning with random positive initialization, element-wise updates of Equation (6) w.r.t $W$ , $H$ , $U$ , $V$ , $C$ , $K$ , $T$ , $R$ , $Θ$ , $Z$ , and $L$ at each iteration are applied until convergence. The gradient descent aims at searching for a local minima of the cost function by moving in the direction of its steepest descent. By introducing Lagrangian multipliers (auxiliary variables), which are $ψ$ , $ϕ$ , $φ$ , ϱ, $ζ$ , $ϖ$ , $κ$ , and $ξ$ to enforce the constraints for $W$ , $H$ , $U$ , $V$ , $C$ , $T$ , $R$ , $K$ , respectively, Equation (7) can be reformulated as:

where $t r (.)$ denotes the trace of a matrix. Using the addition property of the transpose, , and its multiplication property, , we can expand the trace of the first term as

By expanding the remaining terms in Equation (8) and using the trace of a sum of matrix property, $t r (X + Y) = t r (X) + t r (Y)$ , we obtain the following formula: $\begin{matrix} J^{f a c t} (W, H, U, V) & = {min}_{W, H, U, V} t r (M^{T} M) - t r (M^{T} W H^{T}) - t r (W^{T} H M) + t r (H W^{T} W H^{T}) \\ + λ_{1} (t r (W^{T} W) - t r (W^{T} P U) - t r (U^{T} P^{T} W) + t r (U^{T} P^{T} P U)) \\ + λ_{2} (t r (H^{T} H) - t r (H^{T} E V) - t r (V^{T} E^{T} H) + t r (V^{T} E^{T} E V)) \\ + λ_{3} (t r (U^{T} U) - 2 t r (U^{T} V) + t r (V^{T} V)) \\ + λ_{4} (t r (W^{T} W) + t r (H^{T} H) + t r (U^{T} U) + t r (V^{T} V)) \\ + t r (ψ W) + t r (ϕ H) + t r (φ U) + t r (ϱ V) \end{matrix}$ (12)

Similar to the process of getting Equation (12), we expand the Equation (9) as: $\begin{matrix} J^{c o m m} (C, K) & = {min}_{C, K} t r (A^{p r o x T} A^{p r o x}) - t r (A^{p r o x T} P T C^{T}) \\ - t r (C T^{T} P^{T} A^{p r o x}) + t r (C T^{T} P^{T} P T C^{T}) \\ + t r (B^{p r o x T} B^{p r o x}) - t r (B^{p r o x T} E R K^{T}) \\ - t r (K R^{T} E^{T} B^{p r o x}) + t r (K R^{T} E^{T} E R K^{T}) \\ + α (t r (C^{T} C C^{T} C) - 2 t r (C^{T} C) + t) \\ + β (t r (K^{T} K K^{T} K) - 2 t r (K^{T} K) + r) \\ + λ_{5} (t r (C^{T} C) + t r (K^{T} K)) + t r (ϖ C) + t r (ξ K) \end{matrix}$ (13)

Expanding Equation (10), we obtain the following:

As explained earlier, the objective functions in Equations (12–14) are not convex with respect to all parameters combined. Instead in NMF, $W$ , $H$ , $U$ , $V$ , $C$ , $K$ , $T$ , $R$ , $Θ$ , $L$ , and $Z$ are individually optimized in an iterative process, where we update one matrix at a time while keeping the remaining matrices fixed. This ensures convergence to a local minima for each subproblem. This method is called block-coordinate descent. Hence, the update of parameters occurs in the following four alternate optimization steps for $J^{f a c t}$ : (i) the basis matrix $W$ , representing pathway factors; (ii) the latent coefficient matrix $H$ , representing EC factors; (iii) the linear transformation $U$ ; and (iv) the other linear transformation $V$ . For $J^{c o m m}$ , we alternate between the community indicator matrix $C$ for pathways and the other community indicator matrix $K$ for ECs. Finally, we optimize, alternatively, the two community representation matrices $T$ and $R$ for pathways and ECs, respectively, the two auxiliary matrices $L$ and $Z$ , and the input weight matrix $Θ$ . The three objective functions, $J^{f a c t}$ , $J^{c o m m}$ , and $J^{p a t h}$ , are run simultaneously in a divide-and-conquer strategy. Detailed rules for updating all the variables are outlined next.

Update the basis matrix $W$ . To update the feature matrix $W$ , we fix $H$ , $U$ and $V$ . Then, the objective function in Equation (12) w.r.t $W$ is reduced to the following formula (after dropping the min operation):

\begin{matrix} J^{f a c t} (W) & = - t r (M^{T} W H^{T}) - t r (W^{T} H M) + t r (H W^{T} W H^{T}) \\ + λ_{1} (t r (W^{T} W) - t r (W^{T} P U) - t r (U^{T} P^{T} W)) \\ + λ_{4} t r (W^{T} W) + t r (ψ W) \end{matrix}

(15)

where $ψ$ is the Lagrange multiplier for the constraint $W \geq 0$ . For computing the gradient of this equation, we use the following properties with respect to $X$ :

\begin{matrix} \nabla_{X} t r (X^{T} X) = 2 X \\ \nabla_{X} t r (X Y) = Y^{T} \\ \nabla_{X} t r (X^{T} Y) = Y \\ \nabla_{X} t r (X^{T} Y X) = (Y + Y^{T}) X \\ \nabla_{X} t r (X Y X^{T}) = X (Y^{T} + Y) \\ \nabla_{X} t r (Y X Z) = Y^{T} Z^{T} \\ \nabla_{X} t r (Y X^{T} Z) = Z Y \end{matrix}

(16)

By computing the gradient of the cost function in Equation (15) w.r.t $W$ to 0, we have:

ψ = 2 M H - 2 W (H^{T} H + Q) + 2 λ_{1} P U

(17)

where $Q = (λ_{1} + λ_{4})$ . Following the Karush–Kuhn–Tucker (KKT) condition for the non-negativity of W, we have the following equation:

Given an initial value of W, the successive updating rule of W is:

W \leftarrow W \circ \frac{M H + λ_{1} P U}{W (H^{T} H + Q)}

(19)

The iterative update rules in Equation (19) are transformed into multiplicative update rules, which cannot generate negative elements since all values are positive and only multiplications and divisions are involved at each iteration (Lee and Seung, 1999).

Update the latent coefficient matrix $H$ . The feature matrix $H$ is updated as described earlier in which $W$ , $U$ , and $V$ are fixed to obtain the objective function for Equation (12) w.r.t $H$ as:

\begin{matrix} J^{f a c t} (H) & = - t r (M^{T} W H^{T}) - t r (W^{T} H M) + t r (H W^{T} W H^{T}) \\ + λ_{1} (t r (H^{T} H) - t r (H^{T} E V) - t r (V^{T} E^{T} H)) \\ + λ_{4} t r (H^{T} H) + t r (ϕ H) \end{matrix}

(20)

Taking the derivative of the cost function in Equation (20) w.r.t $H$ to 0 and using the gradient properties in Equation (16), we obtain the following:

ϕ = 2 M^{T} W - 2 H (W^{T} W + Q) + 2 λ_{1} E V

(21)

where $Q = (λ_{1} + λ_{4})$ . With the KKT complementary condition for the non-negativity of H, we have:

The multiplicative updates after some algebraic manipulation w.r.t parameter $H$ :

H \leftarrow H \circ \frac{M^{T} W + λ_{1} E V}{H (W^{T} W + Q)}

(23)

Update the linear transformation $U$ . Suppose that $W$ , $H$ , and $V$ are fixed, then Equation (12) w.r.t $U$ is reduced to:

\begin{matrix} J^{f a c t} (U) = λ_{1} (- t r (W^{T} P U) - t r (U^{T} P^{T} W) + t r (U^{T} P^{T} P U)) \\ + λ_{3} (t r (U^{T} U) - 2 t r (U^{T} V)) + λ_{4} t r (U^{T} U) + t r (φ U) \end{matrix}

(24)

Then, we take the derivative of the formula just cited with respect to the transformation matrix $U$ to 0:

φ = 2 λ_{1} P^{T} W - 2 (λ_{1} P^{T} P + D) U + 2 λ_{3} V

(25)

where $D = (λ_{3} + λ_{4})$ . Formulating the equation just cited based on KKT conditions for the non-negativity of U results in:

Then, the parameter $U$ is updated according to:

U \leftarrow U \circ \frac{λ_{1} P^{T} W + λ_{3} V}{(λ_{1} P^{T} P + D) U}

(27)

Update the linear transformation $V$ . To update the linear transformation matrix $V$ , such that $W$ , $H$ , and $U$ are fixed, then the transformation matrix $V$ is updated such that the error is minimized:

\begin{matrix} J^{f a c t} (V) & = λ_{2} (- t r (H^{T} E V) - t r (V^{T} E^{T} H) + t r (V^{T} E^{T} E V)) \\ + λ_{3} (- 2 t r (U^{T} V) + t r (V^{T} V)) + λ_{4} t r (V^{T} V) + t r (ϱ V) \end{matrix}

(28)

Taking the derivative of this error with respect to $V$ to 0 and after some manipulations, we have:

ϱ = 2 λ_{2} E^{T} H - 2 (λ_{2} E^{T} E + D) V + 2 λ_{3} U

(29)

where $D = (λ_{3} + λ_{4})$ . Following the KKT conditions for the non-negativity of V, we have:

As usual, the parameter $V$ is updated accordingly:

V \leftarrow V \circ \frac{λ_{2} E^{T} H + λ_{3} U}{(λ_{2} E^{T} E + D) V}

(31)

Update the community indicator matrix $C$ for pathways. In a similar process, we fix $K$ , and we update $C$ . The matrix $C$ is updated such that the error is minimized:

\begin{matrix} J (C) & = - t r (A^{p r o x T} P T C^{T}) - t r (C T^{T} P^{T} A^{p r o x}) + t r (C T^{T} P^{T} P T C^{T}) \\ + α (t r (C^{T} C C^{T} C) - 2 t r (C^{T} C)) + λ_{5} t r (C^{T} C) + t r (ϖ C) \\ - t r (Y^{T} L T C^{T}) - t r (C T^{T} L^{T} Y) + t r (C T^{T} L^{T} L T C^{T}) \end{matrix}

(32)

Taking the derivative of this error with respect to C to 0, we have:

ϖ = 2 A^{p r o x T} P T + 2 Y^{T} L T + 4 α C - 2 C (T^{T} P^{T} P T + T^{T} L^{T} L T + 2 α C^{T} C + λ_{5})

(33)

Again, we follow the KKT conditions for the non-negativity of C

The parameter $C$ is updated accordingly:

C \leftarrow C \circ \frac{A^{p r o x T} P T + Y^{T} L T + 2 α C}{C (T^{T} P^{T} P T + T^{T} L^{T} L T + 2 α C^{T} C + λ_{5})}

(35)

Update the community indicator matrix $K$ for ECs. Once the parameter $C$ is updated, we use it to update $K$ . The matrix $K$ is updated such that the error is minimized:

\begin{matrix} J (K) & = - t r (B^{p r o x T} E R K^{T}) - t r (K R^{T} E^{T} B^{p r o x}) + t r (K R^{T} E^{T} E R K^{T}) \\ + β (t r (K^{T} K K^{T} K) - 2 t r (K^{T} K)) + λ_{5} t r (K^{T} K) + t r (ξ K) \\ - t r (X^{T} L R K^{T}) - t r (K R^{T} L^{T} X) + t r (K R^{T} L^{T} L R K^{T}) \end{matrix}

(36)

Taking the derivative of this error with respect to $K$ to 0, we have:

ξ = 2 B^{p r o x T} E R + 2 X^{T} L R + 4 β K - 2 K (R^{T} E^{T} E R + R^{T} L^{T} L R + 2 β K^{T} K + λ_{5})

(37)

Using the KKT conditions for the non-negativity of K, we obtain:

The parameter $K$ is updated accordingly:

K \leftarrow K \circ \frac{B^{p r o x T} E R + X^{T} L R + 2 β K}{K (R^{T} E^{T} E R + R^{T} L^{T} L R + 2 β K^{T} K + λ_{5})}

(39)

Update the community representation matrix $T$ for pathways. By fixing the parameters $C$ , $R$ , and $K$ , we update $T$ . The matrix $T$ is updated such that the error is minimized:

\begin{matrix} J (T) & = - t r (A^{p r o x T} P T C^{T}) - t r (C T^{T} P^{T} A^{p r o x}) \\ + t r (C T^{T} P^{T} P T C^{T}) - t r (Y^{T} L T C^{T}) \\ - t r (C T^{T} L^{T} Y) + t r (C T^{T} L^{T} L T C^{T}) \\ + λ_{5} t r (T^{T} T) + t r (ζ T) \end{matrix}

(40)

Taking the derivative of this error with respect to $T$ to 0, we have:

ζ = 2 P^{T} A^{p r o x} C + 2 L^{T} Y C - 2 (P^{T} C C^{T} P + λ_{5}) T - 2 L^{T} L T C^{T} C

(41)

Using the KKT conditions for the non-negativity of T, we obtain:

The parameter $T$ is updated accordingly:

T \leftarrow T \circ \frac{P^{T} A^{p r o x} C + L^{T} Y C}{(P^{T} C C^{T} P + λ_{5}) T + L^{T} L T C^{T} C}

(43)

Update the community representation matrix $R$ for EC features. By fixing the parameters $C$ , $T$ , and $K$ , we update $R$ . The matrix $R$ is updated such that the error is minimized:

\begin{matrix} J (R) & = - t r (B^{p r o x T} E R K^{T}) - t r (K R^{T} E^{T} B^{p r o x}) \\ + t r (K R^{T} E^{T} E R K^{T}) - t r (X^{T} L R K^{T}) \\ - t r (K R^{T} L^{T} X) + t r (K R^{T} L^{T} L R K^{T}) \\ + λ_{5} t r (R^{T} R) + t r (κ R) \end{matrix}

(44)

Taking the derivative of this error with respect to $R$ to 0, we have:

κ = 2 E^{T} B^{p r o x} K + 2 L^{T} X K - 2 (E^{T} K K^{T} E + λ_{5}) R - 2 L^{T} L R K^{T} K

(45)

Using the KKT conditions for the non-negativity of R, we obtain:

The parameter $R$ is updated accordingly:

R \leftarrow R \circ \frac{E^{T} B^{p r o x} K + L^{T} X K}{(E^{T} K K^{T} E + λ_{5}) R + L^{T} L R K^{T} K}

(47)

Update the weight matrix $Θ$ . By fixing the other parameters, we update $Θ$ . The matrix $Θ$ is updated such that the error is minimized:

where $f (.)$ is a non-linear sigmoid function, that is, $f (x) = σ (x) = \frac{1}{1 + e^{- x}}$ . This choice can be generalized to any non-linear functions. By transforming $X$ with $σ (.)$ and $Θ$ , our method enables pathway prediction. Taking the derivative of this error with respect to $Θ$ to 0, we have:

Due to the non-closed form of the equation just cited, we use the iterative gradient descent approach with a defined learning rate $η$ . Hence, the general update rule for $Θ$ becomes:

Θ^{i + 1} \leftarrow Θ^{i} - η \circ \nabla_{Θ} J^{p a t h} (Θ^{i})

(50)

10.

Update the auxiliary matrix $L$ . By fixing the rest of parameters in $J^{p a t h}$ , the matrix $L$ is updated such that the error is minimized:

\begin{matrix} J^{p a t h} (L) & = - t r (X^{T} L R K^{T}) - t r (K R^{T} L^{T} X) + t r (K R^{T} L^{T} L R K^{T}) \\ - t r (Y^{T} L T C^{T}) - t r (C T^{T} L^{T} Y) \\ + t r (C T^{T} L^{T} L T C^{T}) + λ_{6} t r (L^{T} L) \end{matrix}

(51)

Taking the derivative of this error with respect to $L$ to 0, we have:

\begin{matrix} \nabla_{L} J^{p a t h} (L) = 2 (L T C^{T} C T^{T} + L R K^{T} K R^{T} - Y C T^{T} - X K R^{T} + λ_{6} L) \end{matrix}

(52)

The parameter $L$ is updated accordingly:

L^{i + 1} \leftarrow L^{i} - η \circ \nabla_{L} J^{p a t h} (L^{i})

(53)

11.

Update the auxiliary matrix $Z$ . By fixing the rest of parameters in $J^{p a t h}$ , the matrix $Z$ is updated such that the error is minimized:

\begin{matrix} J^{p a t h} (Z) & = - ρ t r (Θ^{T} Z H W^{T}) - ρ t r (W H^{T} Z^{T} Θ) \\ + ρ t r (W H^{T} Z^{T} Z H W^{T}) + λ_{6} t r (Z^{T} Z) \end{matrix}

(54)

Taking the derivative of this error with respect to $Z$ to 0, we have:

\begin{matrix} \nabla_{Z} J^{p a t h} (Z) = 2 (ρ Z H W^{T} W H^{T} - ρ Θ W H^{T} + λ_{6} Z) \end{matrix}

(55)

The parameter $Z$ is updated according to the gradient descent approach as:

Z^{i + 1} \leftarrow Z^{i} - η \circ \nabla_{Z} J^{p a t h} (Z^{i})

(56)

5.4. Appendix A4: experimental setup

In this section, we describe the experimental framework used to demonstrate triUMPF pathway prediction performance across multiple datasets spanning the genomic information hierarchy (Basher et al., 2020). All experimental tests were conducted on a Linux server by using 10 cores of Intel Xeon CPU E5-2650.

5.4.1. Association matrices

MetaCyc v21 (Caspi et al., 2016b) was used to obtain the three association matrices, P2E ( $M$ ), P2P, ( $A$ ), and E2E ( $B$ ). Some of the properties for each matrix are summarized in Table 3. All three matrices are extremely sparse. For example, $M$ contains 2526 pathways, having an average of 4 EC associations per pathway, leaving more than 3600 columns with 0 values. These matrices will be utilized to obtain higher-order proximity (Section 5.5.1) and to analyze triUMPFs robustness (Section 5.5.2).

Table 3.

Characteristics of MetaCyc Database and the Three Association Matrices

	No. of EC	No. of compound	No. of pathway	$\| V \|$	$\| ℰ \|$
MetaCyc (uec)	6378	13689	2526	22593	33,353
M	3650	—	2526	—	8576
A	—	—	2526	—	9938
B	3650	—	—	—	35,629

MetaCyc (uec) denotes enzymatic reactions where links among enzymatic reactions are removed. The “—” indicates non-applicable operation.

EC, enzyme commission.

5.4.2. Description of datasets

We report the performance of triUMPF by using the following data: (i) T1 golden consisting of six PGDBs from the BioCyc collection (biocyc): EcoCyc (v21), HumanCyc (v19.5), AraCyc (v18.5), YeastCyc (v19.5), LeishCyc (v19.5), and TrypanoCyc (v18.5); (ii) three E. coli genomes consisting of E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864) (Welch et al., 2002); (iii) BioCyc (v20.5 T2 & 3) (Caspi et al., 2016a) consisting of 9255 PGDBs with 1463 distinct pathways; (iv) reduced complexity of mealybug symbiont genomes from Moranella (GenBank NC-015735) and Tremblaya (GenBank NC-015736) encoding distributed metabolic pathways for amino acid biosynthesis (McCutcheon and Von Dohlen, 2011); (v) the CAMI initiative low complexity dataset (edwards.sdsu.edu/research/cami-challenge-datasets/), consisting of 40 genomes (Sczyrba et al., 2017); and (vi) whole genome shotgun sequences from the HOTS at 25 m, 75 m, 110 m (sunlit), and 500 m (dark) ocean depth intervals downloaded from the NCBI Sequence Read Archive under accession numbers SRX007372, SRX007369, SRX007370, and SRX007371 (Stewart et al., 2011). T1 PGDBs were refined to include only those pathways that cross-intersect with the MetaCyc database (v21) (Caspi et al., 2016b).

The detailed characteristics of the datasets are summarized in Table 4. For each dataset $S$ , we use $| S |$ and L( $S$ ) to represent the number of instances and pathway labels, respectively. In addition, we also present some characteristics of the multi-label datasets, which are denoted as: 1.

Label cardinality [LCard $(S) = \frac{1}{n} \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = t} ℐ [Y_{i, j} \neq - 1]$ ], where $ℐ$ is an indicator function. It denotes the average number of pathways in $S$ .

Label density [LDen $(S) = \frac{L C a r d (S)}{L (S)}$ ]. This is simply obtained through normalizing LCard( $S$ ) by the number of total pathways in $S$ .

Distinct labels [DL $(S)$ ]. This notation indicates the number of distinct pathways in $S$ .

Proportion of distinct labels [PDL $(S) = \frac{D L (S)}{| S |}$ ]. It represents the normalized version of DL( $S$ ), and it is obtained by dividing DL(.) with the number of instances in $S$ .

Table 4.

Experimental Data Set Properties

Dataset	$\| S \|$	L( $S$ )	LCard( $S$ )	LDen( $S$ )	DL( $S$ )	PDL( $S$ )	R( $S$ )	RCard( $S$ )	RDen( $S$ )	DR( $S$ )	PDR( $S$ )	PLR( $S$ )	Domain
AraCyc	1	510	510	1	510	510	2182	2182	1	1034	1034	$0.2337$	Arabidopsis thaliana
EcoCyc	1	307	307	1	307	307	1134	1134	1	719	719	$0.2707$	Escherichia coli K-12 substr. MG1655
HumanCyc	1	279	279	1	279	279	1177	1177	1	693	693	$0.2370$	Homo sapiens
LeishCyc	1	87	87	1	87	87	363	363	1	292	292	$0.2397$	Leishmania major Friedlin
TrypanoCyc	1	175	175	1	175	175	743	743	1	512	512	$0.2355$	Trypanosoma brucei
YeastCyc	1	229	229	1	229	229	966	966	1	544	544	$0.2371$	Saccharomyces cerevisiae
Three E. coli	3	—	—	—	—	—	2353	$784.3333$	$0.3333$	634	$211.3333$	—	E. coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864)
BioCyc	9255	1804003	$194.9220$	$0.0001$	1463	$0.1581$	8848714	$956.1009$	$0.0001$	2705	$0.2923$	$0.2039$	BioCyc version 20.5 (tier 2 and 3)
Symbiont	3	—	—	—	—	—	304	$101.3333$	$0.3333$	130	$43.3333$	—	Composed of Moranella and Tremblaya
CAMI	40	6261	$156.5250$	$0.0250$	674	$16.8500$	14269	$356.7250$	$0.0250$	1083	$27.0750$	$0.4388$	Simulated microbiomes of low complexity
HOTS	4	—	—	—	—	—	182675	$26096.4286$	$0.1429$	1442	$206.0000$	—	Metagenomic Hawaii Ocean Time-series (10 m, 75 m, 110 m, and 500 m)

The notations $| S |$ , L( $S$ ), LCard( $S$ ), LDen( $S$ ), DL( $S$ ), and PDL( $S$ ) represent: number of instances, number of pathway labels, pathway labels cardinality, pathway labels density, distinct pathway labels, and proportion of distinct pathway labels for $S$ , respectively. The notations R( $S$ ), RCard( $S$ ), RDen( $S$ ), DR( $S$ ), and PDR( $S$ ) have similar meanings for the enzymatic reactions $ℰ$ in $S$ . PLR( $S$ ) represents a ratio of L( $S$ ) to R( $S$ ). The last column denotes the domain of $S$ .

CAMI, Critical Assessment of Metagenome Interpretation; HOTS, Hawaii Ocean Time Series.

The notations R( $S$ ), RCard( $S$ ), RDen( $S$ ), DR( $S$ ), and PDR( $S$ ) have similar meanings for the enzymatic reactions $ℰ$ in $S$ . Finally, PLR( $S$ ) represents a ratio of L( $S$ ) to R( $S$ ).

5.4.3. Pathway and enzymatic reaction features

triUMPF was trained by using BioCyc v20.5, which contains less than 1460 trainable pathways. To offset this limit, we applied pathway2vec (Basher and Hallam, 2020) by using the RUST-norm (or “crt”) module to obtain pathway and EC features, indicated by $P$ and $E$ , respectively, with the following settings: The number of memorized domain is 3, the explore and the in-out hyperparameters are 0.55 and 0.84, respectively, the number of sampled path instances is 100, the walk length is 100, the embedding dimension size is $m = 128$ , the neighborhood size is 5, the size of negative samples is 5, and the used configuration of MetaCyc is “uec,” indicating that links among ECs are being trimmed.

After generating node features, we only apply EC features to concatenate each example i according to: $\begin{matrix} {\tilde{x}}^{(i)} = x^{(i)} ⨁ \frac{1}{r} x^{(i)} E \end{matrix}$ (57)

where $⨁$ indicates the vector concatenation operation, $E \in ℛ^{r \times m}$ corresponds to the feature matrix of ECs, and $m = 128$ . The addition of features results in a dimension of size $r + m$ , where $r = 3650$ . We expect that by incorporating enzymatic reactions features into the original r dimensional example $x^{(i)}$ , the modified ${\tilde{x}}^{(i)}$ summarizes informative characteristics, which are expected to be useful in the prediction task.

5.4.4. Parameter settings

For training, unless otherwise indicated, the learning rate was set to 0.0001, batch size to 50, number of epochs to 10, number of components $k = 100$ , and number of pathway and EC communities to $p = 90$ and $v = 100$ , respectively. The higher-order proximity for $A^{p r o x}$ and $B^{p r o x}$ (corresponding P2P and E2E matrices, respectively, in Section 5.4.1) was set to $l^{p} = 3$ and $l^{e} = 1$ and their associated weights were fixed as $ω = 0.1$ and $γ = 0.3$ , respectively. The $α$ and $β$ were fixed to $1 0^{9}$ . For the regularized hyperparameters $λ_{*}$ , we performed 10-fold cross-validation on MetaCyc and a subsample of BioCyc T2 & 3 data and found the settings $λ_{1 : 5} = 0.01$ , $λ_{6} = 10$ , and $ρ = 0.001$ to be optimum on golden T1 data.

5.5. Appendix A5: experimental results

Four tests were performed to benchmark the performance of triUMPF, including parameter sensitivity, network reconstruction, impact of $ρ$ , and metabolic pathway prediction.

5.5.1. Parameter sensitivity

The impact of seven hyperparameters ( $k, p, v, l_{p}, l_{e}$ , $ω$ , and $γ$ ) was evaluated in relation to matrix reconstruction costs for ( $M$ , $A^{p r o x}$ , and $B^{p r o x}$ ). The reconstruction cost (or error) defines the sum of mean squared errors accounted in the process of transforming the decomposed matrices into its original form, where lower cost entails the decomposed low-dimensional matrices that were able to better capture the representations of the original matrix. We specifically evaluated the effects of varying the following parameters: (i) the number of components $k \in {20, 50, 70, 90, 120}$ , (ii) the community size of pathway $p \in {20, 50, 70, 90, 100}$ and EC $v \in {20, 50, 70, 90, 100}$ , (iii) the higher-order proximity l_p and $l_{e} \in {1, 2, 3}$ , and (iv) weights of the polynomial order $ω$ and $γ \in {0.1, 0.2, 0.3}$ . We used the full matrix $M$ , for each test; however, for community detection, we used BioCyc T2 & 3 data, which is divided into training (80%), validation (5%), and test sets (15%). The final costs for community detection are reported based on the test set after 10 successive trials. In addition, we contrast triUMPF with the standard NMF for monitoring the reconstruction costs of $M$ by varying k values. We emphasize that $M$ , $A^{p r o x}$ , and $B^{p r o x}$ were collected from MetaCyc (Section 5.4.1) and not from BioCyc T2 & 3 (Section 5.4.2).

Figure 4 shows the effect of rank k on triUMPF performance. In general, we observe steady performance with increasing k. Although this contrasts standard NMF, where reconstruction cost decreases as the number of features increases, it is expected because, unlike standard NMF, triUMPF exploits two types of correlations to recover $M$ : (i) within ECs or pathways and (ii) betweenness interactions that serve as additional regularizers. As observed in Figure 4, higher k values result in improved outcomes. Consequently, we selected $k = 100$ for downstream testing.

FIG. 4.

Sensitivity of components k based on reconstruction cost.

For community detection, we observed optimal results with respect to pathway community size at $p = 20$ under parameter settings $k = 100$ and $v = 100$ , as shown in Figure 5a. However, because $A^{p r o x}$ is so sparse, we suggest that this low rank may not correspond to the optimum community size. As with all methods of community detection, triUMPF is sensitive to community size and requires empirical testing. Therefore, we tested settings between $p = 20$ and $p = 100$ and observed a decrease in performance under parameter settings $k = 100$ and $v = 100$ with $p = 90$ providing a balance between cost and increased community size. A similar result was observed for EC community size at $v = 100$ under parameter settings $p = 90$ and $k = 100$ in Figure 5b.

FIG. 5.

Sensitivity of community size and higher order proximity with weights based on reconstruction cost. (a) Pathway community p (k = 100, v = 100); (b) EC community v (k = 100, p = 90); (c) effect of l_p; (d) Effect of l_e.

Finally, we show the effect of changing polynomial orders, and their weights on triUMPF performance. From Figure 5c, we see that reconstruction cost progressively increases with varying higher orders for l_p for all the three weights $ω$ . However, for the same reasons described earlier, we prefer more long distances with less weight to preserve community structure, and remarkably, when $ω = 0.1$ triUMPF performance was relatively stable after the second order. The same conclusion can be drawn for l_e and its associated weights $γ$ in Figure 5d.

Based on these results, triUMPF performance is stable while minimizing cost under the following parameter settings: $k = 100$ , $p > 90$ , $e > 90$ , $l_{p} = 3$ , $ω = 0.1$ , $l_{e} = 1$ , and $γ = 0.3$ . Therefore, we recommend these settings for both MetaCyc and BioCyc T2 & 3.

5.5.2. Network reconstruction

In this section, we explore the robustness of triUMPF when exposed to noise. Links were randomly removed from $M$ , $A$ , and $B$ according to $ε \in {20 %, 40 %, 60 %, 80 %}$ . We used the partially linked matrices to refine parameters while comparing the reconstruction cost against the full association matrices $M$ , $A$ , and $B$ . Specifically for $M$ , we varied components of $M$ according to $k \in {20, 50, 70, 90, 120}$ along with ε. For all experiments, both MetaCyc and BioCyc T2 & 3 were applied for training by using hyperparameters described in Section 3.4 of the primary text.

Figure 6a indicate that by progressively increasing noise $ε$ to $M$ , the reconstruction cost increases when k is low. As more features are incorporated, the cost at all noise levels steadily decreases up to $k = 100$ . This tendency indicates that both pathway and EC features ( $P$ and $E$ ) contain useful correlations that contribute to the resilience of triUMPFs performance when $M$ is perturbed.

FIG. 6.

Link prediction results by varying noise levels $ε \in {20 %, 40 %, 60 %, 80 %}$ based on reconstruction cost. (a) Effect of k; (b) EC links recovery; (c) pathway links recovery.

For $A^{p r o x}$ and $B^{p r o x}$ , as shown in Figure 6b and d, the costs are reduced in the presence of noise, which is not surprising as the reconstruction of associated communities is constrained on both data and $A^{p r o x}$ and $B^{p r o x}$ . These results are directly linked to the sparseness of both matrices, as previously described in Fortunato and Hric (2016). The pathway graph network indicates that many pathways constitute islands with no direct links, whereas some pathways are densely connected. For community detection, it is sufficient to group nodes that are densely connected, whereas links between communities can remain sparse. The same line of reasoning follows for the EC network.

5.5.3. Impact of $ρ$

Figure 7 shows the inverse effect in predictive performance on T1 golden datasets when decreasing $ρ$ before reaching a performance plateau at $ρ = 0.001$ . The hyperparameter $ρ$ in Equation (5) controls the amount of information propagation from $M$ to pathway label coefficients $Θ$ . This suggests, in practice, that lesser constraints should be emphasized on $Θ$ , while not neglecting associations between EC numbers and pathways indicated in $M$ .

FIG. 7.

Effect of $ρ$ based on average F1 score using golden datasets.

5.5.4. Metabolic pathway prediction

Here, we investigate the effectiveness of triUMPF for the pathway prediction task on (i) T1 golden data, (ii) three E. coli data, and (iii) HOTS.

5.5.4.1. T1 golden data

We compare the performance of triUMPF on six benchmark datasets, as described in Section 5.4.2, against the other pathway prediction algorithms using four evaluation metrics: Hamming loss, average precision, average recall, and average F1 score. As shown in Table 5, triUMPF achieved competitive performance against the other methods in terms of average precision.

Table 5.

Predictive Performance of Each Comparing Algorithm on Six Golden T1 Data

Methods	EcoCyc	HumanCyc	AraCyc	YeastCyc	LeishCyc	TrypanoCyc
Hamming loss ↓
PathoLogic	$0.0610$	$0 . 0633$	$0.1188$	$0 . 0424$	$0 . 0368$	$0 . 0424$
MinPath	$0.2257$	$0.2530$	$0.3266$	$0.2482$	$0.1615$	$0.2561$
mlLGPR	$0.0804$	$0 . 0633$	$0 . 1069$	$0.0550$	$0.0380$	$0.0590$
triUMPF	$0 . 0435$	$0.0954$	$0.1560$	$0.0649$	$0.0443$	$0.0776$
Average precision ↑
PathoLogic	$0.7230$	$0 . 6695$	$0.7011$	$0.7194$	$0 . 4803$	$0 . 5480$
MinPath	$0.3490$	$0.3004$	$0.3806$	$0.2675$	$0.1758$	$0.2129$
mlLGPR	$0.6187$	$0.6686$	$0.7372$	$0.6480$	$0.4731$	$0.5455$
triUMPF	$0 . 8662$	$0.6080$	$0 . 7377$	$0 . 7273$	$0.4161$	$0.4561$
Average recall ↑
PathoLogic	$0.8078$	$0.8423$	$0.7176$	$0.8734$	$0.8391$	$0.7829$
MinPath	$0 . 9902$	$0 . 9713$	$0 . 9843$	$1 . 0000$	$1 . 0000$	$1 . 0000$
mlLGPR	$0.8827$	$0.8459$	$0.7314$	$0.8603$	$0.9080$	$0.8914$
triUMPF	$0.7590$	$0.3835$	$0.3529$	$0.3319$	$0.7126$	$0.6229$
Average F1 ↑
PathoLogic	$0.7631$	$0.7460$	$0.7093$	$0 . 7890$	$0.6109$	$0.6447$
MinPath	$0.5161$	$0.4589$	$0.5489$	$0.4221$	$0.2990$	$0.3511$
mlLGPR	$0.7275$	$0 . 7468$	$0 . 7343$	$0.7392$	$0 . 6220$	$0 . 6768$
triUMPF	$0 . 8090$	$0.4703$	$0.4775$	$0.4735$	$0.5254$	$0.5266$

For each performance metric, “↑” indicates the smaller score is better whereas “↑” indicates the higher score is better.

Values in boldface represent the best performance score.

5.5.4.2. Three E. coli data

Figure 8 shows pathway communities observed for MG1655, CFT073, and EDL933 by using BioCyc T2 & 3, including MetaCyc in training. Table 6 shows the top 5 communities along with pathways that were predicted by triUMPF for MG1655. Figure 9 shows that PathoLogic was able to infer more than 90 additional pathways when taxonomic pruning is disabled. Table 7 summarizes GapMind (Price et al., 2018) results for MG1655, CFT073, and EDL933. Figure 10 shows the results for both PathoLogic with taxonomic pruning enabled and triUMPF. Without taxonomic pruning, PathoLogic predicted 56 pathways across the 3 strains encompassing 15 amino acid biosynthesis pathways and 20 pathway variants, including the l-proline biosynthesis II (from arginine) pathway that is known only for eukaryotes (Fig. 11), consequently, increasing false-positive pathway prediction.

FIG. 8.

Pathway community networks for related T1 and T3 organismal genomes. Pathway communities for (a) Escherichia coli K-12 substr. MG1655 (TAX-511145), (b) E. coli str. CFT073 (TAX-199310), and (c) E. coli O157:H7 str. EDL933 (TAX-155864) based on community detection. Nodes colored in dark gray indicate pathways predicted by PathoLogic; lime pathways predicted by triUMPF; salmon pathways predicted by both PathoLogic and triUMPF; red expected pathways not predicted by both PathoLogic and triUMPF; magenta expected pathways predicted only by PathoLogic; purple expected pathways predicted solely by triUMPF; and green expected pathways predicted by both PathoLogic and triUMPF. Light gray indicates pathways not expected to be encoded in either organismal genome. The node sizes reflect the degree of associations between pathways.

FIG. 9.

A three-way set analysis of predicted pathways for Escherichia coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864) using PathoLogic (without taxonomic pruning).

FIG. 10.

Comparison of predicted pathways for Escherichia coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864) datasets between PathoLogic (taxonomic pruning) and triUMPF. Red circles indicate that neither method predicted a specific pathway, whereas green circles indicate that both methods predicted a specific pathway. Lime circles indicate pathways predicted solely by mlLGPR, and gray circles indicate pathways solely predicted by PathoLogic. The size of circles corresponds to the associated pathway coverage information.

FIG. 11.

Comparison of predicted pathways for Escherichia coli K-12 substr. MG1655 (TAX-511145), E. coli str. CFT073 (TAX-199310), and E. coli O157:H7 str. EDL933 (TAX-155864) datasets between PathoLogic (without taxonomic pruning) and triUMPF. Red circles indicate that neither method predicted a specific pathway, whereas green circles indicate that both methods predicted a specific pathway. Lime circles indicate pathways predicted solely by mlLGPR, and gray circles indicate pathways solely predicted by PathoLogic. The size of circles corresponds to the associated coverage information.

Table 6.

Top 5 Communities with Pathways Predicted by triUMPF for Escherichia coli K-12 Substr

Community index	MetaCyc pathway ID	MetaCyc pathway name	Status
67	PWY0-1182	Trehalose degradation II (trehalase)	True
	PWY-6910	Hydroxymethylpyrimidine salvage	True
	HOMOSER-THRESYN-PWY	l-Threonine biosynthesis	True
	PUTDEG-PWY	Putrescine degradation I	True
	PWY-6611	Adenine and adenosine salvage V	True
	FERMENTATION-PWY	Mixed acid fermentation	True
	ENTNER-DOUDOROFF-PWY	Entner-Doudoroff pathway I	True
34	ASPARAGINESYN-PWY	l-Asparagine biosynthesis II	True
	PWY-5340	Sulfate activation for sulfonation	True
	PWY-6618	Guanine and guanosine salvage III	True
	PWY0-1314	Fructose degradation	True
	PWY-7181	Pyrimidine deoxyribonucleosides degradation	True
	PWY0-1299	Arginine dependent acid resistance	True
	PWY0-42	2-Methylcitrate cycle I	True
9	NAGLIPASYN-PWY	Lipid-A-precursor biosynthesis (E. coli)	True
	PWY-7221	Guanosine ribonucleotides de novo biosynthesis	True
	KDOSYN-PWY	Kdo transfer to lipid IV $_{A}$ I (E. coli)	True
	PWY0-1309	Chitobiose degradation	True
	PPGPPMET-PWY	ppGpp biosynthesis	True
	PWY-6608	Guanosine nucleotides degradation III	True
	PWY-5656	Mannosylglycerate biosynthesis I	False
47	PLPSAL-PWY	Pyridoxal 5′-phosphate salvage I	True
	PWY0-1313	Acetate conversion to acetyl-CoA	True
	PYRUVDEHYD-PWY	Pyruvate decarboxylation to acetyl CoA	True
	PWY-4381	Fatty acid biosynthesis initiation (bacteria and plants)	True
	PWY0-662	PRPP biosynthesis	True
81	HISTSYN-PWY	l-Histidine biosynthesis	True
	PWY-6147	6-Hydroxymethyl-dihydropterin diphosphate biosynthesis I	True
	PWY-7176	UTP and CTP de novo biosynthesis	True
	PWY-6932	Selenate reduction	False

MG1655 (TAX-511145). The last column asserts whether a pathway is present in or absent (a false-positive pathway) from EcoCyc reference data.

CTP, cytidine-triphosphate; UTP, uridine-triphosphate.

Table 7.

Eighteen Amino Acid Biosynthesis Pathways and 27 Pathway Variants

Amino acid	MetaCyc pathway ID	MetaCyc pathway name
Arginine	ARGSYNBSUB-PWY	l-Arginine biosynthesis II (acetyl cycle)
	PWY-5154	l-Arginine biosynthesis III (via N-acetyl-l-citrulline)
	PWY-7400	l-Arginine biosynthesis IV (archaebacteria)
Asparagine	ASPARAGINE-BIOSYNTHESIS	l-Asparagine biosynthesis I
Asparagine	ASPARAGINESYN-PWY	l-Asparagine biosynthesis II
Chorismate	PWY-6163	chorismate biosynthesis from 3-dehydroquinate
Cysteine	CYSTSYN-PWY	l-Cysteine biosynthesis I
Cysteine	PWY-6308	l-Cysteine biosynthesis II (tRNA-dependent)
Glutamine	GLNSYN-PWY	l-Glutamine biosynthesis I
Glycine	GLYSYN-PWY	Glycine biosynthesis I
Glycine	GLYSYN-THR-PWY	Glycine biosynthesis IV
Histidine	HISTSYN-PWY	l-Histidine biosynthesis
Isoleucine	ILEUSYN-PWY	l-Isoleucine biosynthesis I (from threonine)
Isoleucine	PWY-5104	l-Isoleucine biosynthesis IV
Leucine	LEUSYN-PWY	l-Leucine biosynthesis
Lysine	DAPLYSINESYN-PWY	l-Lysine biosynthesis I
	PWY-2941	l-Lysine biosynthesis II
	PWY-2942	l-Lysine biosynthesis III
Methionine	HOMOSER-METSYN-PWY	l-Methionine biosynthesis I
Methionine	PWY-702	l-Methionine biosynthesis II
Phenylalanine	PHESYN	l-Phenylalanine biosynthesis I
Proline	PROSYN-PWY	l-Proline biosynthesis I
Serine	SERSYN-PWY	l-Serine biosynthesis
Threonine	HOMOSER-THRESYN-PWY	l-Threonine biosynthesis
Tryptophan	TRPSYN-PWY	l-Tryptophan biosynthesis
Tyrosine	TYRSYN	l-Tyrosine biosynthesis I
Valine	VALSYN-PWY	l-Valine biosynthesis

5.5.4.3. HOTS water column

Here, we use triUMPF to infer a set of pathways from the HOTS water column spanning sunlit and dark ocean depth intervals comparing results with other prediction methods, including PathoLogic and mlLGPR. The results are presented in Figure 12.

FIG. 12.

Comparative study of predicted pathways for HOT DNA samples. The size of circles corresponds to the associated coverage information.

5.5.4.4. Availability of data and materials

The triUMPF source code is available under the MIT License on GitHub (hallamlab/triUMPF) with detailed descriptions on how to install and execute all commands run to generate results in our GitHub repository. The MetaCyc database can be obtained from metacyc.org. The T1 golden datasets can be downloaded from biocyc.org. For the symbiotic Candidatus Moranella endobia and Candidatus Tremblaya princeps genomes, they can be downloaded from GenBank under accession numbers NC-015735 and NC-015736 whereas the simulated CAMI low complexity dataset can be obtained from edwards.sdsu.edu/research/cami-challenge-datasets Unassembled whole genome shotgun DNA pyrosequences from HOTS (10 m, 75 m, 110 m, and 500 m) can be obtained from the NCBI Sequence Read Archive under accession numbers SRX007372, SRX007369, SRX007370, and SRX007371. The preprocessed datasets used in this article can be downloaded from zenodo.org/YNfvDehKhPY. The same zenodo repo contains a pre-trained triUMPF (“triUMPF.pkl”) using configurations stated in Section 5.4.

Footnotes

Acknowledgments

The author would like to thank Connor Morgan-Lang, Kishori Konwar, and Aria Hahn for lucid discussions on the function of the triUMPF model and all members of the Hallam Lab for helpful comments along the way.

Author Disclosure Statement

S.J.H. is a co-founder of Koonkie, Inc., a bioinformatics consulting company that designs and provides scalable algorithmic and data analytics solutions in the cloud.

Funding Information

This work was performed under the auspices of Genome Canada, Genome British Columbia, the Natural Science and Engineering Research Council (NSERC) of Canada, and Compute/Calcul Canada). A.R.M.A.B. and R.J.M. were supported by a UBC four-year doctoral fellowship (4YF) administered through the UBC Graduate Program in Bioinformatics.

References

Bairoch

2000. The enzyme database in 2000. Nucleic Acids Res. 28, 304–305.

Basher

A.R.M

.A., and Hallam

S. J.

2020. Leveraging heterogeneous network embedding for metabolic pathway prediction. Bioinformatics, 37, 822–829.

Basher

A.R.M

.A., McLaughlin

R.J.

, and Hallam

S.J.

2020. Metabolic pathway inference using multi-label classification with rich pathway features. PLoS Comput. Biol. 16, e1008174.

Bro

, and Smilde

A. K.

2014. Principal component analysis. Anal. Methods. 6, 2812–2831.

Carbonell

, Wong

, Swainston

, et al. 2018. Selenzyme: Enzyme selection tool for pathway design. Bioinformatics, 34, 2153–2154.

Caspi

, Billington

, Foerster

, et al. 2016a. Biocyc: Online resource for genome and metabolic pathway analysis. FASEB J. 30(1 Suppl), lb192–lb192.

Caspi

, Billington

, Ferrer

, et al. 2016b. The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic Acids Res. 44, D471–D480.

Cichocki

, Zdunek

, and Amari , S.-i. 2007. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. International Conference on Independent Component Analysis and Signal Separation. pp. 169–176. Berlin, Heidelberg.

Dale

J.M.

, Popescu

, and Karp

P.D.

2010. Machine learning methods for metabolic pathway prediction. BMC Bioinformatics, 11, 1–14.

10.

Fortunato

, and Hric

2016. Community detection in networks: A user guide. Phys. Rep. 659, 1–44.

11.

, Huang

, Sidiropoulos

N.D.

, et al. 2019. Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36, 59–80.

12.

Gillis

2020. Nonnegative Matrix Factorization. SIAM–Society for Industrial and Applied Mathematics.

13.

Hahn

A.S.

, Konwar

K.M.

, Louca

, et al. 2016. The information science of microbial ecology. Curr. Opin. Microbiol. 31, 209–216.

14.

Hanson

N.W.

, Konwar

K.M.

, Hawley

A.K.

, et al. 2014. Metabolic pathways for the whole community. BMC Genomics, 15, 1–14.

15.

, and Garcia

E.A.

2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284.

16.

Jiao

, Ye

, and Tang

2013. Probabilistic inference of biochemical reactions in microbial communities from metagenomic sequences. PLoS Comput. Biol. 9, e1002981.

17.

Kanehisa

, Furumichi

, Tanabe

, et al. 2017. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361.

18.

Karp

P.D.

, Latendresse

, Paley

S.M.

, et al. 2016. Pathway tools version 19.0 update: Software for pathway/genome informatics and systems biology. Brief. Bioinform. 17, 877–890.

19.

Kim

, and Park

2007. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics. 23, 1495–1502.

20.

Konwar

K.M.

, Hanson

N.W.

, Pagé

A.P.

, et al. 2013. Metapathways: A modular pipeline for constructing pathway/genome databases from environmental sequence information. BMC Bioinformatics. 14, 202.

21.

Lakshminarayanan

, Pritzel

, and Blundell

2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inform. Process. Syst. 8, 6402–6413.

22.

Lawson

C.E.

, Harcombe

W.R.

, Hatzenpichler

, et al. 2019. Common principles and best practices for engineering microbiomes. Nat. Rev. Microbiol. 17, 725–741.

23.

Lee

D.D.

, and Seung

H.S.

1999. Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791.

24.

Lee

D.D.

, and Seung

H.S.

2001. Algorithms for non-negative matrix factorization. Adv. Neural Inform. Process. Syst. 13, 556–562.

25.

, Wang

, Zhang

, et al. 2019. Learning network embedding with community structural information. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. pp. 2937–2943. Macao.

26.

McCutcheon

J.P.

, and Von Dohlen

C.D.

2011. An interdependent metabolic patchwork in the nested symbiosis of mealybugs. Curr. Biol. 21, 1366–1372.

27.

McDonald

A.G.

, Boyce

, and Tipton

K.F.

2009. Explorenz: The primary source of the iubmb enzyme list. Nucleic Acids Res. 37(Suppl 1), D593–D597.

28.

Natarajan

, and Dhillon

I.S.

2014. Inductive matrix completion for predicting gene–disease associations. Bioinformatics, 30, i60–i68.

29.

Price

M.N.

, Zane

G.M.

, Kuehl

J.V.

, et al. 2018. Filling gaps in bacterial amino acid biosynthesis pathways with high-throughput genetics. PLoS Genet, 14, e1007147.

30.

Rossi

R.A.

, Jin

, Kim

, et al. 2020. On proximity and structural role-based embeddings in networks: Misconceptions, techniques, and applications. ACM Trans. Knowl. Discov. Data, 14, 1–37.

31.

Sczyrba

, Hofmann

, Belmann

, et al. 2017. Critical assessment of metagenome interpretation—A benchmark of metagenomics software. Nat.e Methods, 14, 1063–1071.

32.

Shafiei

, Dunn

, Chipman

, et al. 2014. Biomenet: A Bayesian model for inference of metabolic divergence among microbial communities. PLoS Comput. Biol. 10, e1003918.

33.

Stewart

F.J.

, Sharma

A.K.

, Bryant

J.A.

, et al. 2011. Community transcriptomics reveals universal patterns of protein sequence conservation in natural microbial communities. Genome Biol. 12, 1–24.

34.

Toubiana

, Puzis

, Wen

, et al. 2019. Combined network analysis and machine learning allows the prediction of metabolic pathways from tomato metabolomics data. Commun. Biol. 2, 1–13.

35.

Wang

, Cui

, Wang

, et al. 2017. Community preserving network embedding. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. pp. 203–209. Melbourne, Australia.

36.

Welch

R.A.

, Burland

, Plunkett

, et al. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 99, 17020–17024.

37.

Yang

, and Michailidis

2015. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics, 32, 1–8.

38.

, and Doak

T.G.

2009. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol. 5, e1000465.

Metabolic Pathway Prediction Using Non-Negative Matrix Factorization with Improved Precision

Abstract

1. Introduction

2. Methods

2.1. Decomposing the pathway EC association matrix

2.2. Community reconstruction and multi-label learning

3. Results

3.1. T1 golden data

3.2. Three E. coli data

3.3. Mealybug symbionts data

3.4. CAMI and HOTS data

4. Discussion and Conclusion

5. Appendices

5.1. Appendix A1: definitions and problem formulation

5.2. Appendix A2: detailed description of triUMPF method

5.2.1. Decomposing the pathway–EC association matrix

5.2.2. Subnetwork or community reconstruction

5.2.3. Multi-label learning process

5.3. Appendix A3: pptimization

5.4. Appendix A4: experimental setup

5.4.1. Association matrices

5.4.2. Description of datasets

5.4.3. Pathway and enzymatic reaction features

5.4.4. Parameter settings

5.5. Appendix A5: experimental results

5.5.1. Parameter sensitivity

5.5.2. Network reconstruction

5.5.3. Impact of ρ

5.5.4. Metabolic pathway prediction

5.5.4.1. T1 golden data

5.5.4.2. Three E. coli data

5.5.4.3. HOTS water column

5.5.4.4. Availability of data and materials

Footnotes

Acknowledgments

Author Disclosure Statement

Funding Information

References

5.5.3. Impact of $ρ$