Abstract
Chronic liver disease (CLD) is a significant planetary health burden. CLD includes a broad range of liver pathologies from different causes, for example, hepatitis B virus infection, fatty liver disease, hepatocellular carcinoma, and nonalcoholic fatty liver disease or the metabolic associated fatty liver disease. Biomarker and diagnostic discovery, and new molecular targets for precision treatments are timely and sorely needed in CLD. In this context, multi-omics data integration is increasingly being facilitated by artificial intelligence (AI) and attendant digital transformation of systems science. While the digital transformation of multi-omics integrative analyses is still in its infancy, there are noteworthy prospects, hope, and challenges for diagnostic and therapeutic innovation in CLD. This expert review aims at the emerging knowledge frontiers as well as gaps in multi-omics data integration at bulk tissue levels, and those including single cell-level data, gut microbiome data, and finally, those incorporating tissue-specific information. We refer to AI and related digital transformation of the CLD research and development field whenever possible. This review of the emerging frontiers at the intersection of systems science and digital transformation informs future roadmaps to bridge digital technology discovery and clinical omics applications to benefit planetary health and patients with CLD.
Introduction
Chronic liver disease (CLD) is a major global health burden with 1.1 million deaths annually (Asrani et al, 2019). CLD consists of a wide range of liver pathologies with different causes. The recent digital transformation of systems science offers hope for multi-omics data integration, and by extension, for precision and personalized medicine in CLD.
The two most common etiologies for CLD are hepatitis B virus (HBV) infection and fatty liver disease. Among cancers, hepatocellular carcinoma (HCC) is the most common type of liver cancer, and one of the leading causes of cancer-related deaths worldwide with a dismal average 5-year survival rate of less than 32% (Asrani et al, 2019; Global Burden of Disease Liver Cancer Collaboration et al, 2017). HCC displays a high level of heterogeneity owing to etiology (e.g., HBV, hepatitis C virus, nonalcoholic fatty liver disease [NAFLD], alcohol related, etc.), biological and multi-omics variations, population-based differences, and environmental exposures (Lee, 2015). The complexity and heterogeneity in HCC are roadblocks for the development of effective and precision health interventions for HCC, resulting in a limited selection of treatment modalities.
NAFLD is yet another common CLD worldwide with a global prevalence about 25%, which is predicted to keep rising to about 30% by year 2030 (Younossi et al, 2019). Because many metabolic and systemic factors, such as obesity, hypertension, hyperlipidemia, hyperglycemia, and others, are involved in the pathogenesis of NAFLD, it was renamed metabolic associated fatty liver disease (Eslam et al, 2020).
Artificial intelligence (AI) and related digital technologies as featured in OMICS have been transforming the systems science research and clinical applications (Gulfidan et al, 2021; Lin and Wu, 2022; Ozer et al, 2020). AI applications in image analysis, including radiology, ultrasound, and histopathology for CLD, have been reviewed elsewhere in hepatology (Lupsor-Platon et al, 2021; Nam et al, 2022; Zhou et al, 2019), and in relation to electronic health record data. This expert review aims at knowledge gaps in multi-omics data integration at bulk tissue levels, and those including single cell-level data, gut microbiome data, and finally, those incorporating tissue-specific information, referring to AI and related digital transformation in the field whenever possible.
Integrative Analysis Using Multi-Omics Data from Bulk Tissues
The Cancer Genome Atlas (TCGA) Research Network performed the first large-scale multi-omics analysis of HCC. This included whole exome sequencing and DNA copy number analyses for 363 HCC cases; moreover, in the core set of 196 patients, further analyses, including DNA methylation, mRNA expression, microRNA (miRNA) expression, and protein expression were carried out (Cancer Genome Atlas Research Network, 2017). They identified 26 genes as significantly mutated genes (SMGs). TERT promoter mutations were the most common, with 44%, 87 out of 196 patients analyzed, carrying the mutation. Other common SMGs included the tumor suppressor genes TP53 (31%), AXIN1 (8%), and RB1 (4%), the Wnt pathway oncogene CTNNB1 (27%), and the chromatin remodeling genes ARID1A (7%), ARID2 (5%), and BAP1 (5%).
Another interesting finding was that 84% (37 of 44) of HBV-infected HCC cases harbored HBV DNA integration into the host genome with evidence of RNA sequencing (RNA-seq) reads from fusion transcripts. In the end, the authors used a joint multivariate regression approach to integrate muti-omics data, including DNA copy number, DNA methylation, mRNA expression, miRNA expression, and Reverse-Phase Protein Array. They were able to cluster the HCC into three integrated iCluster 1, 2, and 3.
HCC in iCluster 1 showed high expression of miR-181a (a lipid metabolism regulator) and epigenetic silencing of miR-122, overexpression of proliferation marker genes such as MYBL2, PLK1, and MKI67. They subsequently built a class prediction model based on the 200 most variably expressed genes across the three iClusters and identified that iCluster 1 predicted poor survival in three independent datasets, suggesting the power of integrative analysis (Cancer Genome Atlas Research Network, 2017).
Chaudhary et al (2018) integrated multi-omics data, including RNA-seq, miRNA sequencing, and methylation data from TCGA using deep learning (DL) AI approach to predict the prognosis of HCC patients. They identified two HCC subtypes with significant survival differences (p = 7.13e-6). The more aggressive subtype was associated with frequent TP53 inactivation mutations, higher expression of stemness gene KRT19 and EPCAM, and tumor marker BIRC5, and contained activated Wnt and Akt signaling pathways.
The most striking observation was that the DL model solely relied on the omics data without using clinical characteristics, unlike most previously identified prediction models. Adding clinical information did not further improve the performance of the model. The DL-based prediction model was robust against two external datasets from China and three additional datasets from Japan, Hawaii, and continental USA, and against different data types (RNA-seq, mRNA microarray, miRNA array, and DNA methylation data).
The robustness of the observations made by Chaudhary et al (2018) and their DL-based model can be in part explained by their use of autoencoder as the dimension reduction method. Autoencoder is a type of artificial neural network unsupervised learning of unlabeled data. Chaudhary et al. also compared Principal Component Analysis as the dimension reduction method. They found that the approach using autoencoder performed better. They also showed that bypassing the feature reduction steps and using individual Cox proportional hazard-based models did not perform as well.
Shen et al (2021) identified two multi-omics molecular subtypes (iC1 and iC2) for HCC after integration of copy number variation (CNV), DNA methylation, and mRNA expression data from 363 patients, where the iC1 subtypes had a worse prognosis compared with the iC2 subtypes. In addition, they conducted immune infiltration estimation for HCC using the Tumor Immune Estimation Resource (TIMER) website (https://cistrome.shinyapps.io/timer/) (Li et al, 2017). They estimated the infiltration levels of six immune cells: B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells. They found that the infiltration levels of the six immune cells in the iC1 subtype were significantly higher than those in the iC2 subtype. Two genes, ANXA2 and CHAF1B with overexpression in HCC, showed significant correlation with HCC prognosis.
Integrative Analysis Using Multi-Omics Data from Single-Cell Analysis
Cavalli et al (2020) performed single-nuclei RNA-seq for 4282 single nuclei of a liver sample, and identified seven major cell types: hepatocytes (HCs), lined by specialized sinusoidal endothelial cell, Kupffer cells (KCs), active and inactive hepatic stellate cells, cholangiocytes, intrahepatic immune cells (NK-like/T/B cells), and a small percentage of cells of probably nervous/arterial origin. They then integrated long-range HiCap interactions and proteomics data from bulk experiments to infer gene-specific mRNA-to-protein conversion factor and identify specific enhancer/promoter interactions to specific liver cell populations. Finally, they focused on the analysis of the enzyme dihydropyrimidine dehydrogenase (DPYD), which is encoded by the DPYD gene.
DPYD is involved in the first step in the catabolism of pyrimidines in liver. A deficit in DPYD increases the risk of toxicity, including severe bone marrow depression and neurotoxicity for common drugs 5-fluorouracil (fluoropyrimidine chemotherapeutics fluorouracil) or capecitabine, a chemotherapy drug used to treat breast, colon, rectal, stomach, esophageal, HCC, and pancreatic cancers. By integrated omics analysis, they were able to localize the expression of DPYD to both HCs and KCs, suggesting both nonparenchymal and macrophagic involvement in the pyrimidine catabolism.
By single-cell full-length RNA-seq technology, Sun et al (2021) analyzed the transcriptome of 113 single circulating tumor cells (CTCs) from four different vascular sites, including hepatic vein, peripheral artery, peripheral vein, and portal vein in HCC patients. They showed that the transcriptional dynamics of CTCs were associated with stress response, cell cycle, and immune evasion signaling. Chemokine CCL5, which is regulated by p38-MAX signaling, recruits regulatory T cells (Tregs) to facilitate immune escape and metastatic seeding of CTCs. The study showed the importance of spatial heterogeneity and revealed an immune escape mechanism of CTC (Sun et al, 2021). All in all, they found that patients with Treghigh/CCL5+ CTChigh cells demonstrated a worse overall survival (OS) compared with the rest of patients.
Integrative Analysis Using Multi-Omics Data, Including Gut Microbiomes
The gut microbiomes refer to over one trillion microorganisms living in a symbiotic relationship with the host. They play important roles for CLDs through the microbiome/gut/liver axis. Oh et al (2020) compared stool microbiomes across 163 well-characterized participants encompassing non-NAFLD controls, NAFLD cirrhosis patients, and their first-degree relatives. By integrating data from shotgun metagenomic and untargeted metabolomic profiles and applying the random forest machine learning algorithm and differential abundance analysis, they identified metagenomic signature with 19 discriminatory species and metabolomic signature with 17 metabolites, both with good performance in detecting NAFLD cirrhosis (area under the receiver operating curve of 0.91).
They further showed that combining the 19 discriminatory metagenomic signature with host's age and serum albumin levels could identify cirrhosis independent of disease etiology and host's ethnicity from different geographical regions. Combining with host's serum aspartate aminotransferase levels, allows the differential diagnosis of cirrhosis from earlier stages of fibrosis (Oh et al, 2020).
Behary et al (2021) applied metagenomic and metabolomic analysis for the gut microbiomes in patients with NAFLD-related cirrhosis, with or without HCC. They found that there were characteristic changes in microbiome compositions in patients with NAFLD-HCC. For example, at the family level, they found expansion of Enterobacteriaceae in NAFLD-HCC compared with NAFLD cirrhosis (p = 0.033) and non-NAFLD controls (p = 0.025), and significantly enriched bacterial species Bacteroides caecimuris (p < 0.0001) and Veillonella parvula (p = 0.002) in NAFLD-HCC compared with NAFLD cirrhosis and non-NAFLD control.
They also identified that the gene function in the microbiota changed toward capacity for short-chain fatty acid production from dietary fiber along with HCC development (Behary et al, 2021). For example, pyruvate carboxylase (pycA), which is responsible for the production of oxaloacetate from pyruvate, and two genes related to acetate synthesis (phosphate acetyltransferase; pta) and butyrate/acetylphosphate synthesis (phosphate butyryltransferase; ptb) were all overexpressed in NAFLD-HCC compared with NAFLD cirrhosis and non-NAFLD control (Behary et al, 2021).
In addition, they showed that only the extracts from the NAFLD-HCC microbiota, but not from the control groups, modulated the peripheral immune response by eliciting a T cell immunosuppressive phenotype, with expansion of Tregs and reduction of CD8+ T cells.
Another study reported on the potential of the gut microbiota as a biomarker for HCC in Chinese HBV-related cirrhotic patients (Ren et al, 2019). Using a random forest model, they identified 30 microbial markers for differentiating early HCC from non-HCC samples, with an area under the curve of 0.8. They found that the butyrate-producing genera were decreased, while lipopolysaccharide (LPS)-generating genera were increased in early HCC versus controls. Phylum Actinobacteria was increased, and 13 genera, including Gemmiger, Parabacteroides, and Paraprevotella were enriched in early HCC versus cirrhosis. Phylum Verrucomicrobia, and 12 genera, including Alistipes, Phascolarctobacterium, and Ruminococcus were decreased in early HCC versus controls (Ren et al 2019).
Zheng et al (2020) compared the gut microbiota of hepatitis (n = 24), liver cirrhosis (LC) (n = 24), HCC (n = 75), and healthy controls (n = 20). The HCC group contained 52 LC-HCC and 23 nonliver cirrhosis-induced HCC (NLC-HCC). They found that the fecal microbial diversity was significantly increased in HCC compared with LC. The abundance of butyrate-producing bacteria was decreased, whereas LPS-producing bacteria increased in LC-HCC patients. The phyla Fusobacteria and Proteobacteria were significantly increased whereas the phylum Tenericutes was significantly decreased in LC compared with HCC. Thirteen genera were associated with the HCC tumor size and three genera (Enterococcus, Limnobacter, and Phyllobacterium) could be used as biomarkers for HCC diagnosis.
Pinero et al (2019) compared the gut microbiome in cirrhotic patients with and without HCC and found a threefold increase of Erysipelotrichaceae, and an increased Bacteroides/Prevotella ratio, but a fivefold decrease in Leuconostocaceae family and a fivefold decrease of Fusobacterium genus in patients with HCC comparing to those without HCC. Using a random forest model, they were able to correctly classify HCC cases using differential abundant taxa with an area under the receiver operating characteristic of 0.75. The differential pattern observed between HCC and non-HCC patients was linked to inflammation with activation of nucleotide-binding and oligomerization domain-like receptor pathways (Pinero et al, 2019).
An analysis of the gut microbiota and mediators of bile acid (BA) signaling by an integrative study of both stool and serum samples from patients with nonalcoholic steatohepatitis (NASH) non-HCC and NASH-HCC and healthy controls identified altered microbiota diversity and BA signaling in cirrhotic and noncirrhotic NASH-HCC (Sydor et al, 2020). Changes were found in several bacteria involved in the BA metabolism, including Bacteroides and Lactobacilli. In particular, Sydor et al (2020) showed that total BA, primary conjugated BA, and the individual BA, including glycine-conjugated cholic acid, taurine-conjugated cholic acid, glycine-conjugated chenodeoxycholic acid, and taurine-conjugated chenodeoxycholic acid were associated with the abundance of Lactobacillus during disease progression of NASH fibrosis.
Another noteworthy study improved the iNetModels to iNetModels 2.0 as an interactive visualization and database of multi-omics data and Multi-Omics Biological Networks (Arif et al, 2021). Furthermore, it included not only omics data from human tissues but also data from the oral and gut microbiome (Arif et al, 2021).
Integrating Multi-Omics from Multiple Tissues
Another level of multi-omics integration involves approaches to integrate multi-omics data from multiple tissues to evaluate the tissue specificity/enrichment of putative biomarkers to achieve better specificity in diagnostic discovery. In addition, by integrating with tissue specificity/enrichment across all tissues, one can identify tissue-specific drug targets that show minimal toxicity to other tissues.
Tissue-enriched genes (TEGs) or tissue-specific genes (TSGs) show highly enriched or specific expression pattern in one tissue with no or very low expression in the rest of the tissues. Integrating data from multiple tissues allows identification of TEGs or TSGs, for which alternation of expression and their associated protein changes often provide better diagnostic or prognostic values than those genes that were expressed in multiple tissues (Hood et al, 2004; Huang et al, 2015).
Several tissue-specific databases were constructed in the past using expressed sequence tags (ETS) or microarray data, but they are of limited sensitivity due to limited coverage of ETS or microarray data (Liu et al, 2008). More recently, data, including RNA-seq data for TEG analysis, were published. For example, Kim et al (2018) built a Tissue-specific Gene DataBase (TissGDB in cancer) (http://zhaobioinfo.org/TissGDB) by curating 2461 TSGs across 22 tissue types, and conducted gene expression, somatic mutation, and prognostic marker-based analyses using 28 cancer types from TCGA. Supplementary Table S1 displays a list from the analysis for the liver-specific genes using TissGDB.
Furthermore, by integrating other data, including genomics data such as CNV and fusion transcript analysis, coexpressed protein interaction network, survival outcomes, including OS and relapse-free survival, they were able to correlate TSGs with cancer type-specific isoform expression, with fusion events with oncogenes or tumor suppressor genes, and with prognosis for survival time.
A database was built on tissue and cancer-specific biological networks, named TCSBN, by integrating biological networks, including GEnome-scale metabolic Models, transcriptional regulatory networks, protein–protein interaction networks, signaling networks, and coexpression networks (CNs) (Lee et al, 2018). They then added the tissue component to build a tissue-specific integrated network for liver, muscle, and adipose tissues, and generated human CNs for 46 normal tissues and 17 types of cancers. The database is available at http://inetmodels.com.
We used MBL-associated serine protease 2 (MASP2), a liver-specific gene as an example, and searched for CNs in normal tissues and in liver cancer in the iNetModel website. We found that the CNs differ dramatically from normal to liver cancer as shown in Figure 1. For example, in the positive correlation CNs, only four genes were common between the two CNs. They are haptoglobin (HP), hemopexin (HPX), serpin family C member 1 (SERPINC1), and serpin family G member 1 (SERPING1).

The difference in the CNs between normal and hepatocellular carcinoma.
In the CNs for liver cancer, 20 new genes with positive correlation to MASP2 were identified (Table 1). From the switch of the genes in the CNs from normal tissues to liver cancers (Table 1), we can see a switch from mostly complement components in normal tissues to genes with novel functions such as the tumor protein p63-regulated 1-like (TPRG1L) and the solute carrier family 27 member 5 (SLC27A5). The integrated analysis allows us to gain better understanding of the MASP2 network in liver cancer.
New Coexpression Networks Found Only in Liver Cancer But Not in the Coexpression Networks for Normal Tissues
p value.
Adjusted p value.
Conclusions
CLD is one of the most significant planetary health burdens. Novel biomarkers, diagnostics, and molecular targets for precision medicine are timely and much needed in this context. Multi-omics data integration is increasingly being aided by AI, machine learning, and digital transformation of omics systems science. While multi-omics integrative analyses with AI and related technologies are still in their infancy, there are promising examples as well as challenges in this context for diagnostic and therapeutic innovation in CLD. This review and analysis of the emerging frontiers at the intersection of systems science and digital transformation inform future roadmaps to bridge digital technology discovery and clinical omics applications to benefit planetary health and patients with CLD.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
