Multiomics Integration at Single-Cell Resolution Using Bayesian Networks: A Case Study in Hepatocellular Carcinoma

Abstract

Multiomics data integration is one of the leading frontiers of complex disease research and integrative biology. The advances in single-cell sequencing technologies offer yet another crucial dimension in multiomics research. The single-cell studies enable the study and integration of multiomics data simultaneously in the same cell. We report in this study multiomics data integration in single-cell resolution using Bayesian networks (BNs) in a case study of hepatocellular carcinoma (HCC). A BN encodes the conditional dependencies/independencies of variables using a graphical model with an accompanying joint probability. RNA-seq and Reduced Representation Bisulfite Sequencing data were analyzed separately, and copy number variations were estimated by the hidden Markov model method. Several BN models were constructed to reveal omics' causal and associational relationships. These methods were subjected to a validation study using an independent data set. We show the heterogeneity of the multiple cellular layers of HCC at single-cell omics resolution by identifying best-fitted BN models of 295 genes. We also provide novel insights into the multiomics mechanistic relationships in the human lymphocyte antigen class I genes in HCC. To the best of our knowledge, this is the first study to focus on integrating omics data using a machine learning algorithm, BNs, at the single-cell resolution using a case study of HCC.

Introduction

Understanding the complex human diseases demands unpacking their biology at different molecular levels. Multiomics integration studies can offer the advantage of deeper insights on the mechanisms of heterogeneity in complex diseases compared to single omics analysis (Hasin et al, 2017; Lin et al, 2022; Yan et al, 2018). Therefore, multiomics approach is essential to evaluate and decipher the pathological mechanisms in cancer research.

Cancer has many characteristics that can be studied through, including, but not limited to copy number variations (CNVs), gene expression dysregulation, and epigenetic aberration (Esteller, 2007; Hanahan and Weinberg, 2011). Many studies have shown that these characteristics play a major role in the tumoral biological processes such as cellular differentiation, migration, proliferation, and metastasis (Colagrande et al, 2016; Hanahan and Weinberg, 2011; Marrero et al, 2010). Previous studies focused on these features, mainly in bulk (Chen et al, 2021; Chen et al, 2020; Shen et al, 2021; Woo et al, 2017).

On the other hand, single-cell studies are preferred due to the heterogeneity between cells, and it is easier to discover mechanisms that are not seen when examining large numbers of cells (Dohmen et al, 2022; Liu et al, 2022; Song et al, 2022). Single-cell studies started to focus on other types of omics like genomic variation and epigenomics, besides transcriptomics (Heo et al, 2022; Ono et al, 2021; Song et al, 2022; Velazquez-Villarreal et al, 2020). At the same time, studying these omics individually makes it challenging to discover biomarkers for complex diseases like cancer. To elucidate this complex biological process, analysis of various omics-es together has only started in recent years (Chen et al, 2021; Gutierrez-Arcelus et al, 2013; Ono et al, 2021).

Since omics data are complex, integration at the level and scale of multiomics is quite challenging (Huang et al, 2017). Thus, many theoretical methods and novel algorithms were developed for multiomics data integration, including clustering (Dongfang Wang, 2016; Kiselev et al, 2017; Tini et al, 2019), feature selection (El-Manzalawy et al, 2018; Townes et al, 2019), cell type recognition (Cadwell et al, 2017; Xu and Su, 2015), the dissecting gene regulation networks (Petralia et al, 2015; Tong et al, 2022; Zarayeneh et al, 2017), and partial least square regression models (Drouard et al, 2022).

This study proposes Bayesian networks (BNs) as a multiomics integration method. BNs are graphical probabilistic models that represent the joint probability distribution in a factorized way (Neuberg, 2003; Pearl, 2000). BNs are directed acyclic graph (DAG) composed of nodes and edges where nodes represent variables and edges represent the direct dependency relationship between these variables (Pearl, 2000; Puga et al, 2005). DAGs encode a set of conditional independence assumptions between the variables (Neuberg, 2003; Pearl, 2000). The joint probability is decomposed as a product of local probabilities where the local probability of each variable is described by its conditional dependencies on its neighbors. The local probability distribution can take any form, but a multinomial distribution is usually used for discrete variables. The graphical structure of a BN can be learned from data by either inferring the most likely DAG or discovering the conditional dependencies in the data (Pearl, 2000).

Further to their simple and explainable graphical structures, BNs can be understood without deep mathematical knowledge. Besides, they have shown a good performance with small sample sizes (Pollino and Henderson, 2010). All these features make BNs a preferable machine learning model to be used with molecular data. It was first used in genetics when causation between plasma apoE levels and APOE gene single nucleotide polymorphisms (SNPs) was analyzed (Rodin and Boerwinkle, 2005). Also, one study used BNs to find the causal relationship between significant expression Quantitative Trait Loci, methylation Quantitative Trait Loci, and expression Quantitative Trait Methylation from different cell types at the bulk level (Gutierrez-Arcelus et al, 2013). Another study suggests that BNs can complement conventional Mendelian randomization using actual genome and simulation data (Howey et al, 2020). However, these studies did not focus on disease complexity, and the primary purpose was not the multiomics integration.

This study proposes a novel approach to using BNs for single-cell multiomics integration. For this aspect, a dataset from a published study of 25 hepatocellular carcinoma (HCC) single-cell sequencing data is retrieved (Hou et al, 2016). Gene expression quantifications, DNA methylation levels, and estimated CNVs were used for building three alternative BN models. Our results show the heterogeneity of these omics in cancer cells and highlight the molecular mechanisms that change from gene to gene. We dissect this heterogeneity by showing that different genes in the same pathway can follow different BN models. Our results reveal a picture of causality between omics of human lymphocyte antigen (HLA) class I genes HLA-A, HLA-B, and HLA-C in HCC.

Materials and Methods

Data source and workflow

A dataset of 25 HCC single-cell sequencing (Hou et al, 2016) was acquired from NCBI Gene Expression Omnibus under the accession number (GSE65364) using SRA Toolkit. The dataset consists of single-cell RNA-seq and single-cell reduced representation bisulfite sequencing (RRBS) reads for each sample. First, we analyzed each omic data of these cells on their own. Then, we constructed three alternative BN models of three-way association involving CNV states, gene expression quantifications, and DNA methylation levels. For model comparison, Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to choose the best model for each gene. Genes with verified BN model were further investigated using cBioportal platform (Fig. 1).

FIG. 1.

The research workflow. A flow chart illustrating the steps of the study. Starting with analyzing each omic by its own. Then, estimating CNVs from methylome data. After that, constructing three alternative BN models depending on the data from the three omics. These models were applied to all protein-coding genes and choosing the best model for each gene according to AIC and BIC scores. cBioportal platform was used for enrichment with validated BN models. AIC, Akaike information criterion; BIC, Bayesian information criterion; BN, Bayesian network; CNV, copy number variation.

Sequencing data pre-processing

The quality of raw sequencing data was checked using FastQC (Andrews, 2015). Then, low-quality reads (<20 phred score) and adaptor contaminations were trimmed. For RNA-seq data, Trimmomatic was used with PE SLIDINGWINDOW:5:20 ILLUMINACLIP:NEBNEXT-PE:2:30:10 LEADING:20 options (Bolger et al, 2014). For RRBS data, Trim-Galore with rrbs and paired options was used (Krueger et al, 2021).

Gene expression quantifications

Filtered RNA sequencing reads were aligned to the human reference genome (hg19) UCSC release using Tophat, and gene expression quantification was done using Cufflinks pipeline with default parameters (Trapnell et al, 2012). We only measured the gene expression level of protein-coding genes by using the Refseq gene list (20,203 genes) (Pruitt et al, 2012). Cufflinks pipeline reported gene expression levels in fragments per kilobase million (FPKM). FPKM values were then categorized into three categories: (0.2–10) low, (10–1000) moderate, and (>1000) high for downstream analysis.

Calculation of DNA methylation level

Filtered RRBS reads were aligned to the human reference genome (hg 19) UCSC release C-T (G-A), and cytosine methylation base calling was done by using Bismark with default parameters (Krueger and Andrews, 2011). The methylation level of each base was calculated using beta values, β = M/(M + U), where M and U denote the methylated and unmethylated signal intensities in each detected CpG site (Bibikova et al, 2009). We included only CpG sites with ≥3 of read depth in the downstream analysis (Hou et al, 2016). The methylation level of each gene was calculated by taking the mean of beta values from transcription start site to transcription end site of the corresponding gene. Methylation levels were then categorized into (<0.2) hypomethylated, (0.2–0.8) neutral, and (>0.8) hypermethylated to be used for downstream analysis.

CNV estimation from RRBS data

The CNV was estimated using HMMcopy R package at resolution of 500 kb (Lai et al, 2022). HMMcopy calculates the following: the read counts from the aligned reads (BAM), the GC content of the reference genome, and the mappability of the reference sequence to the aligner. Then, HMMcopy corrects GC content and mappability scores by eliminating the outlier bins and the bins with zero reads or zero GC content. Thereafter, it performs Hidden Markov Model on the normalized scores to predict the biological copy number in each segment. Finally, HMMcopy generates a table containing genomic segment, CNV state, and mean copy number. The CNV states are scores from 1 to 6: (1) homozygous deletion, (2) heterozygous deletion, (3) neutral, (4) increased copy number, (5) heterozygous duplication, and (6) homozygous duplication. The states values were used in the downstream analysis.

Causal BN analysis

FPKM scores, beta values, and CNV states of each gene through all samples were combined as a matrix to be used to construct BN models. In other words, we constructed a matrix for each gene composed of the three omics data from all samples. As the BN models cannot deal with constant features, we only selected the genes with a variance in all omics. We constructed three alternative BN models to explore the causal relationship between omics. “CEM” model assumes a serial connection when CNV affects gene expression, and gene expression affects DNA methylation (Fig. 2A). “CME” model assumes that CNV affects DNA methylation, and DNA methylation affects gene expression in a serial connection (Fig. 2B). “INDEP” model assumes that CNV affects DNA methylation and gene expression independently in a diverging connection (Fig. 2C).

FIG. 2.

The three alternative BN models in the study. (A) “CEM” model assumes that there is a serial connection when CNV affects gene expression, and gene expression affects DNA methylation. (B) “CME” model assumes that CNV affects DNA methylation, and DNA methylation affects gene expression in a serial connection. (C) “INDEP” model assumes that CNV affects both DNA methylation and gene expression independently in a diverging connection. CEM, CNV-Expression-Methylation Model; CME, CNV-Methylation-Expression Model; INDEP, Independent Model.

BNlearn R package was used for constructing, fitting, and comparing BN models (Scutari, 2010). The parameters of models were fitted using maximum likelihood estimation (MLE). Up to this step, each gene had three models. We compared the three models using AIC and BIC scores to choose the best BN model for each gene. Using AIC score, we defined the genes to have a best-fitted model if only the best model is at least 10 times more likely to be than the second best model. This resulted in only 23 genes to have a best-fitted model.

On the other hand, using BIC scores, we defined the genes with a best-fitted model by its strength of evidence. The evidence strength adjusted according to the BIC difference (ΔBIC) between the best and second best model: 0–2: indicates weak evidence, 2–6: positive evidence, 6–10: strong evidence, and >10: very strong evidence (Lorah and Womack, 2019). We considered only the genes with positive evidence or higher (ΔBIC >2). This resulted in 295 genes.

Functional analysis

After defining the genes with best BN model, R package clusterProfiler (Yu et al, 2012) for gene ontology (GO) enrichment analysis was conducted. Then, we further investigate whether these genes have been reported in any HCC study before (Cerami et al, 2012). To explore these genes and their role in HCC, we used cBioportal platform, one of the most comprehensive platforms that includes different databases of cancer studies.

Validation

This method was subjected to a validation study using an independent set of HCC data archived in NCBI database (PRJNA762641). The data consists of 33 samples of bulk RNA-seq and whole genome bisulfite sequencing (WGBS) (Huang et al, 2021). Of the 33 samples, we chose 10 males with the most similar age to our primary HCC single-cell data.

Results

Omics data analysis

From RNA sequencing data, the gene expression values of 5851 genes were calculated (Supplementary Table S1). From RRBS data, DNA methylation of 13,165 genes were calculated (Supplementary Table S2). Sample 16 had the highest detected genes with 15,818 genes, while sample 17 had the least detected genes with 8226 genes. Sample 17 was discarded in the following analysis because of the low number of detected genes, which could affect the number of mutual genes in the downstream analysis (Supplementary Table S2). For CNV estimation, we used the state scores from the output of HMM. The states ranged from 1 to 6 and the meaning of each score is as follows: (1) homozygous deletion, (2) heterozygous deletion, (3) neutral, (4) increased copy number, (5) heterozygous duplication, and (6) homozygous duplication.

Across all samples, an amplification in chromosome 7 and q arm of chromosome 1 and a deletion in chromosome 8 were detected. All these CNVs were previously reported in different studies (Supplementary Fig. S1) (Guan et al, 2000; Xu et al, 2015; Zhou et al, 2017).

Global view of the modeling and causality analysis

BNs can be constructed in either way: after defining the variables (nodes), the relationship (nodes) between these variables can be learned from the data using constraint-based or score-based methods; or they can be built by defining the nodes and the direction of edges between the variables manually (DAGs) and then checking the viability of how much the DAG represents the data. In this study, we created three alternative models by manually defining each model's nodes and edges. As BN models show the causal relationship between the different variables, we thought neither DNA methylation nor transcriptomics could cause the variation in the genomic copy numbers. Thus, we defined the models in a way to be biologically logical by putting CNV as the first (starting) node in all models.

The three alternative models are as follows: “CEM” model represents a serial connection among CNV-gene expression-DNA methylation (Fig. 2A). “CME” model represents a serial connection among CNV-DNA methylation-gene expression (Fig. 2B). “INDEP” model independently represents a diverging connection between CNV-DNA methylation and CNV-gene expression (Fig. 2C).

The 3 alternative BN models were constructed and applied to 1900 protein-coding genes only. After fitting the models using MLE, we evaluated our models by two score-based criteria: AIC and BIC. Depending on AIC, we used the relative likelihood to compare the goodness of fit of a BN model with the lowest AIC to the second lowest model. Then, we chose the genes whose best model has ≥10 relative likelihood than the second-best model. Depending on BIC, we took the difference (ΔBIC) between the models with the lowest and second lowest BIC scores. The strength of the model was examined as mentioned in Materials and Methods section.

According to AIC scores, we detected 23 genes to follow different BN models: 17 CME, 5 CEM, and 1 INDEP (Table 1), whereas we detected 295 genes to follow different BN models: 199 CME, 62 CEM, and 34 INDEP, according to BIC scores (Supplementary Table S3).

Table 1.

Akaike Information Criterion Scores for the Genes (23 Genes) That Have Best Fitted to a Bayesian Network Model with Relative Likelihood ≥10 to the Second Best Model

Gene	INDEP	CME	CEM	Lowest model	Second lowest model	Relative Likelihood
MGAT4C	−22.358	−22.358	−35.559	−35.559	−22.358	735.530
TF	−37.081	−48.346	−37.060	−48.346	−37.081	279.327
TIMM44	−45.712	−55.407	−46.172	−55.407	−46.172	101.260
FBLN1	−46.488	−37.538	−36.983	−46.488	−37.538	87.772
UBR4	−45.888	−53.754	−45.888	−53.754	−45.888	51.040
ZNF695	−35.190	−34.458	−43.021	−43.021	−35.190	50.193
ANKS1B	−27.478	−27.789	−35.333	−35.333	−27.789	43.463
A1BG	−48.647	−57.111	−49.579	−57.111	−49.579	43.191
HLA.B	−53.750	−61.104	−52.870	−61.104	−53.750	39.528
C3	−43.249	−50.508	−43.232	−50.508	−43.249	37.700
DSTN	−39.918	−46.743	−39.345	−46.743	−39.918	30.340
ANP32B	−31.597	−38.096	−31.962	−38.096	−31.962	21.472
ULK1	−38.546	−39.344	−45.416	−45.416	−39.344	20.820
ALDH2	−42.749	−48.762	−42.237	−48.762	−42.749	20.216
PRCC	−54.568	−60.139	−53.755	−60.139	−54.568	16.210
COPZ1	−45.843	−51.342	−46.312	−51.342	−46.312	12.365
MAP3K6	−32.942	−37.924	−32.487	−37.924	−32.942	12.076
GNG4	−41.899	−46.786	−39.503	−46.786	−41.899	11.509
GNS	−43.765	−48.624	−43.765	−48.624	−43.765	11.354
PPP2R5A	−57.359	−62.132	−55.661	−62.132	−57.359	10.877
GON4L	−51.230	−55.903	−50.210	−55.903	−51.230	10.346
SLC26A11	−33.199	−37.928	−33.266	−37.928	−33.266	10.287
SMYD3	−52.628	−52.375	−57.277	−57.277	−52.628	10.220

Bold values are the best fitted BN models for each significant gene.

CEM, CNV-Expression-Methylation Model; CME, CNV-Methylation-Expression Model; INDEP, Independent Model.

For further analysis, we split the genes into two groups: Group 1 contains the genes that have best fitted to BN models according to AIC scores with relative likelihood ≥10 (23 genes), and group 2 includes the genes with a valid model according to BIC score (295 genes). cBioportal results reported that all significant genes in group 1 and 2 had been presented at least once in HCC studies (Supplementary cBioportal links). Furthermore, for group 2 genes, cBioPortal showed that the genes are previously reported, which matched significant signaling pathways in HCC (Table 3). The matched genes are ARRDC1 (CEM/ΔBIC = 2.497) to NOTCH pathway, AXIN2 (INDEP/ΔBIC = 2.381) and TLE2 (CME/ΔBIC = 2.401) to WNT pathway, MGA (CME/ΔBIC = 2.417) MYC (INDEP/ΔBIC = 3.648) to MYC pathway, and MAP2K1 (CME/ΔBIC = 2.316) to RTK-RAS signaling pathway (Supplementary Figs. S2, S3, S4 and S5).

Comparing the genes in each BN model, we observed that the DNA methylation rate in the gene body is higher for the genes in the INDEP model. Still, the methylation rate in the promoter region is the same in all models (Fig. 3A). Furthermore, the mean gene expression in each model differs from the others (Fig. 3B). Genes of each model were found to be located in regions with moderate CNV states: 2—heterozygous deletion, 3—neutral, and 4—increase copy number (Fig. 3C). Most of these genes are located on chromosomes 1, 12, 16, 19, and 20, whereas the results showed no gene from chromosomes 11, 14, 18, and X (Fig. 3D).

FIG. 3.

Genomic, epigenomic, and transcriptomic profile of genes in each model. (A) Methylation rate in promoter and gene body of genes in each model shows the INDEP model genes have the highest one in gene body region, while genes of all models have the same methylation rate in promoter region. (B) Box plot of normalized gene expression in each model shows the average gene expression of genes in each model differs from other models (Student's t-test, p < 0.05). (C) CNV states of each model's genes, CNV state 3 (neutral) is the most in CME model, 4 (increased copy number) in INDEP and CEM, while state 2 (heterozygous) was the least in all models. (D) Human ideogram highlighting the chromosomal location of genes in each model.

On the other hand, functional enrichment analysis showed that CME model genes were enriched in several GO terms, such as intrinsic and integral components of organelle membrane (p = 0.0211), regulation of protein catabolic process and vesicle fusions (p = 0.0158); CEM model genes were enriched in molecular adaptor activity (p = 0.0301), RNA polymerase complex and core enzyme binding (p = 0.0149); and INDEP model genes were enriched in sister chromatid adhesion and segregation (p = 0.0283) (Fig. 4). The dysfunction of sister chromatid segregation was shown to have a significant role in promoting the progression of HCC (Sun et al, 2018).

FIG. 4.

Bar plot of GO enrichment analysis of genes in each model. (A) BN model of CME, (B) BN model of CEM, and (C) BN model of INDEP. GO, gene ontology.

Validation of the BN model

The data that we have used in the primary analysis generated by “scTrio-seq” method could simultaneously detect the transcriptome and DNA methylome of the same cell. Finding other data generated by the same method was challenging. Thus, we validated our analysis using another type of dataset. We used a dataset of bulk RNA-seq and WGBS for HCC patients. After following the same analysis, we could construct the same BN models for only 42 genes. From these, 20 of them (48%) follow the same model as ours, including the HLA-A gene with model CME (Supplementary Table S4). Due to the overlap of different platforms (bulk and single cell) and the differences between RRBS and WGBS data, the validated data analysis did not reveal the exact information to compute the models for all genes. Still, we detected a clear tendency toward some models.

Discussion

Multiomics data integration is one of the leading frontiers of complex disease research and integrative biology. Defining CNV, DNA methylation, and gene expression relationships in cancer cells can help improve the knowledge of how these different molecular layers contribute to the heterogeneity of this disease, which in turn can improve our understanding of the etiology of cancer (Kong et al, 2020). In this study, we use BNs as a multiomics machine learning method to explore the causal relationship between CNV, gene expression, and DNA methylation in HCC at the single-cell level.

We constructed three alternative BN models, assuming the genomic component (CNV) is driving the association with gene expression and DNA methylation. The relationship between gene expression and DNA methylation can be in either direction. The results showed that the CME model, in which CNV affects DNA methylation and DNA methylation in turn affects gene expression, is the most abundant model (best model for 199 genes), followed by CEM (CNV- model), in which CNV affects gene expression, and gene expression in turn affects DNA methylation (best model for 62 genes), and finally, the INDEP model, where CNV affects DNA methylation and gene expression independently (best model for 34 genes).

Gutierrez-Arcelus et al (2013) and Díez-Villanueva et al (2021) have used similar BN models to ours, but differ in using SNP instead of CNV as a genomic component of the models. Gutierrez-Arcelus et al (2013) also assumed that these BN models show the passive and active role of methylation on gene expression. Following their assumptions, we can say that CME model shows the active rol,e while CEM and INDEP models show the passive role of DNA methylation. Our results show that most genes' expressions are under the active impact of DNA methylation in HCC cells.

GO enrichment analysis showed that each model's genes had been enriched in different GO terms. GO terms that enriched for CME model genes were related to vesicle fusion, cristae formation, and component organelle membrane. For the CEM model, GO terms were related to a molecular activity that accompanies binding, bridging, and complex scaffold activity. For the INDEP model, GO terms were related to sister chromatid adhesion in the meiotic and meiosis cell cycle. GO enrichment analysis revealed that the genes of each model have a function in related GO terms that differ from other models. In turn, that can help us confer the molecular function of a gene from the BN model that it follows.

In addition, tumor cell heterogeneity influences cancer progression, metastasis, and mortality rates (Fidler, 2016; Lawson et al, 2018). Integrating these three omics has shown that different genes can follow different BN models in the same pathway, proving the heterogeneity. HCC is a highly metastatic type of cancer that is difficult to diagnose early and has high mortality rates (Liu et al, 2016; Waly Raphael et al, 2012). We have detected specific genes that have been matched to pathways that have significant roles in tumorigenesis, survival, and tumor progression, such as NOTCH, RTK-RAS signaling pathways (Supplementary Figs. S4 and S5) (Huang et al, 2019; Moon and Ro, 2021; Strazzabosco and Fabris, 2012; Yang and Liu, 2017).

NOTCH signaling pathway, one of the conserved pathways in mammals, is responsible for promoting proliferative signaling during neurogenesis and cell fate determination (Zhang et al, 2017; Zhu et al, 2021). NOTCH signaling pathway, despite its unclear effect, is found to be dysregulated in HCC (Huang et al, 2019; Strazzabosco and Fabris, 2012; Villanueva et al, 2012).

Moreover, different genes with different BN models AXIN2 (INDEP) and TLE2 (CME) have been matched in WNT signaling pathway (Supplementary Fig. S2). WNT signaling pathway controls multiple cellular processes that lead to the development, survival, differentiation, and metastasis of HCC (He and Tang, 2020; Liu et al, 2016; Shanbhogue et al, 2011). Also, MGA (CME/ΔBIC = 2.417) and MYC (INDEP/ΔBIC = 3.648) were matched to MYC pathway (Supplementary Fig. S3). MYC pathways, which regulate many cellular processes such as proliferation, differentiation, and apoptosis, have been related to the pathogenicity of HCC (Frau et al, 2010; Lin et al, 2010; Zimonjic and Popescu, 2012). As a result, these different genes with different BN models show how these different omics act on each other and how this effect is heterogenous in cancer cells by affecting major signaling pathways in HCC.

Most importantly, HLA class I genes have been shown to have the same BN models according to BIC score: HLA-A (ΔBIC = 2.357), HLA-B (ΔBIC = 7.943), and HLA-C (ΔBIC = 5.064) (Supplementary Table S1). HLA class I, which occurs on tumor cells' surfaces, is critically recognized by cytotoxic T lymphocytes. Investigating HLA class I helps to predict the danger of prognosing or developing cancer (Kaufman et al, 1995; Mizuki et al, 1997; Speetjens et al, 2008). Studies about HLA class I expression in cancer had contradictory conclusions. Some reported it with a bad prognosis, while others said it with worse or no association with cancer prognosis (Hanagiri et al, 2013; Kaneko et al, 2011; Madjd et al, 2005; Ramnath et al, 2006).

Our results offer a new perspective on how HLA class I might act in HCC by integrating different omics. The proposed BN method has the potential to provide a deeper understanding of heterogeneity in complex diseases as it reveals other connections for each scenario. The BN method we adopted for single-cell sequencing data did not overlap with some results from bulk validation data, but found some exciting hits such as HLA-A following the CME model. The sequencing technology differences between the primary and validation datasets might be why WGBS is used for methylome data. At the same time, we have RRBS data in the primary dataset.

To the best of our knowledge, this is the first study to focus on integrating omics using a machine learning algorithm, BNs, at the single-cell resolution, using a case study of HCC. With these findings that promise multiomics single-cell data integration, the BN approach could be one of the standard single-cell multiomics integration approaches in the future.

Limitations

We used single-cell data generated by one of the first technologies in single-cell sequencing. Our decision to use these data was to apply BN models on multiomics belonging to the same single cell and use this dataset accompanied by many limitations, such as the low quality of the data of some cells. Besides, it was difficult to validate it with another dataset because of lacking similar datasets generated in the same way. Similarly, our analysis was limited to protein-coding genes, observing DNA methylation in promoter and gene body regions and ignoring regulation regions such as enhancers. However, we showed promising results that make BN models eligible and can be used for integrating omics.

Footnotes

Acknowledgments

The computational calculations reported in this article were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Centre (TRUBA resources). This work is not published previously, except for the master thesis of the first author (M.J.) completed at the Department of Bioinformatics, Hacettepe University, Turkey.

Code Availability

All analyses were performed using bash scripting codes and R, and are available in github at

Authors' Contributions

I.Y. conducted this study and supervised this work. M.J. performed the analyses. All authors read and approved the final version of the article.

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This study has been funded by L'Oréal-UNESCO “For Women in Science,” national award received by Dr. İdil Yet in 2020.

Supplementary Material

Abbreviations Used

References

Andrews

FASTQC. A Quality Control Tool for High Throughput Sequence Data; 2015. Available from: www.bioinformatics.babraham.ac.uk/projects/fastqc/ Last accessed on September 11, 2021.

Bibikova

, Le

, Barnes

, et al. Genome-wide DNA methylation profiling using Infinium^® assay. Epigenomics, 2009; 1(1):177–200; doi: 10.2217/epi.09.14

Bolger

, Lohse

, Usadel

. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 2014; 30(15):2114–2120; doi: 10.1093/bioinformatics/btu170

Cadwell

, Scala

, Li

, et al. Multimodal profiling of single-cell morphology, electrophysiology, and gene expression using Patch-Seq. Nat Protoc, 2017; 12(12):2531–2553; doi: 10.1038/nprot.2017.120

Cerami

, Gao

, Dogrusoz

, et al. The CBio Cancer Genomics Portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov, 2012; 2(5):401–404; doi: 10.1158/2159-8290.CD-12-0095

Chen

, Liu

, Du

, et al. Comprehensive genomic and epigenomic analyses on transcriptomic regulation in stomach adenocarcinoma. Front Genet, 2021; 12:778095; doi: 10.3389/fgene.2021.778095

Chen

, Cheng

, Jiang

, et al. Epigenomic and genomic analysis of transcriptome modulation in skin cutaneous melanoma. Aging (Albany NY), 2020; 12(13):12703; doi: 10.18632/aging.103115

Colagrande

, Inghilesi

, Aburas

, et al. Challenges of advanced hepatocellular carcinoma. World J Gastroenterol, 2016; 22(34):7645; doi: 10.3748/wjg.v22.i34.7645

Díez-Villanueva

, Jordá

, Carreras-Torres

, et al. Identifying causal models between genetically regulated methylation patterns and gene expression in healthy colon tissue. Clin Epigenetics, 2021; 13(1):162; doi: 10.1186/s13148-021-01148-9

10.

Dohmen

, Baranovskii

, Ronen

, et al. Identifying tumor cells at the single-cell level using machine learning. Genome Biol, 2022; 23(1):1–23; doi: 10.1186/s13059-022-02683-1

11.

Dongfang Wang

JG.

Integrative clustering methods of multi-omics data for molecule-based cancer classifications. Quant Biol, 2016; 4(1):58–67; doi: 10.1007/s40484-016-0063-4

12.

Drouard

, Ollikainen

, Mykkänen

, et al. Multi-omics integration in a twin cohort and predictive modeling of blood pressure values. OMICS, 2022; 26(3):130–141; doi: 10.1089/omi.2021.0201

13.

El-Manzalawy

, Hsieh

, Shivakumar

, et al. Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data. BMC Med Genomics, 2018; 11(Suppl 3):71; doi: 10.1186/s12920-018-0388-0

14.

Esteller

Cancer Epigenomics: DNA methylomes and histone-modification maps. Nat Rev Genet, 2007; 8(4):286–298; doi: 10.1038/nrg2005

15.

Fidler

IJ.

Commentary on “Tumor Heterogeneity and the Biology of Cancer Invasion and Metastasis.”. Cancer Res, 2016; 76(12):3441–3442; doi: 10.1158/0008-5472.CAN-16-1330

16.

Frau

, Biasi

, Feo

, et al. Prognostic markers and putative therapeutic targets for hepatocellular carcinoma. Mol Aspects Med, 2010; 31(2):179–193; doi: 10.1016/j.mam.2010.02.007

17.

Guan

, Fang

, Sham

, et al. Recurrent chromosome alterations in hepatocellular carcinoma detected by comparative genomic hybridization. Genes Chromosomes Cancer, 2000; 29(2):110–116; doi: 10.1002/1098-2264(2000)9999:9999<::AID-GCC1022>3.0.CO;2-V

18.

Gutierrez-Arcelus

, Lappalainen

, Montgomery

, et al. Passive and active DNA methylation and the interplay with genetic variation in gene regulation. Elife, 2013; 2:e00523; doi: 10.7554/eLife.00523

19.

Hanagiri

, Shigematsu

, Kuroda

, et al. Prognostic implications of human leukocyte antigen class I expression in patients who underwent surgical resection for non-small-cell lung cancer. J Surg Res, 2013; 181(2):e57–e63; doi: 10.1016/j.jss.2012.07.029

20.

Hanahan

, Weinberg

. Hallmarks of cancer: The next Generation. Cell, 2011; 144(5):646–674; doi: 10.1016/j.cell.2011.02.013

21.

Hasin

, Seldin

, Lusis

. Multi-omics approaches to disease. Genome Biol, 2017; 18(1):1–15; doi: 10.1186/s13059-017-1215-1

22.

, Tang

. WNT/beta-catenin signaling in the development of liver cancers. Biomed Pharmacother, 2020; 132:110851; doi: 10.1016/j.biopha.2020.110851

23.

Heo

, Kim

J-H

, Lim

, et al. DNA methylome and single-cell transcriptome analyses reveal CDA as a potential druggable target for ALK inhibitor–resistant lung cancer therapy. Exp Mol Med, 2022; 54(8):1236–1249; doi: 10.1038/s12276-022-00836-7

24.

Hou

, Guo

, Cao

, et al. Single-cell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas. Cell Res, 2016; 26(3):304–319; doi: 10.1038/cr.2016.23

25.

Howey

, Shin

, Relton

, et al. Bayesian network analysis incorporating genetic anchors complements conventional Mendelian randomization approaches for exploratory analysis of causal relationships in complex data. PLoS Genet, 2020; 16(3):e1008198; doi: 10.1371/journal.pgen.1008198

26.

Huang

, Xu

, Han

, et al. Integrative analysis of epigenome and transcriptome data reveals aberrantly methylated promoters and enhancers in hepatocellular carcinoma. Front Oncol, 2021; 11:769390; doi: 10.3389/fonc.2021.769390

27.

Huang

, Li

, Zheng

, et al. The carcinogenic role of the Notch signaling pathway in the development of hepatocellular carcinoma. J Cancer, 2019; 10(6):1570–1579; doi: 10.7150/jca.26847

28.

Huang

, Chaudhary

, Garmire

. More is better: recent progress in multi-Omics data integration methods. Front Genet, 2017; 8:84; doi: 10.3389/fgene.2017.00084

29.

Kaneko

, Ishigami

, Kijima

, et al. Clinical implication of HLA class I expression in breast cancer. BMC Cancer, 2011; 11:454; doi: 10.1186/1471-2407-11-454

30.

Kaufman

, Schoon

, Robertson

, et al. Inhibition of selective signaling events in natural killer cells recognizing major histocompatibility complex class I. Proc Natl Acad Sci U S A, 1995; 92(14):6484–6488; doi: 10.1073/pnas.92.14.6484

31.

Kiselev

, Kirschner

, Schaub

, et al. SC3: Consensus clustering of single-cell RNA-Seq data. Nat Methods, 2017; 14(5):483–486; doi: 10.1038/nmeth.4236

32.

Kong

, Liu

, Zheng

, et al. Multi-omics analysis based on integrated genomics, epigenomics and transcriptomics in pancreatic cancer. Epigenomics, 2020; 12(6):507–524; doi: 10.2217/epi-2019-0374

33.

Krueger

, Andrews

. Bismark: A flexible aligner and methylation caller for bisulfite-Seq applications. Bioinformatics, 2011; 27(11):1571–1572; doi: 10.1093/bioinformatics/btr167

34.

Krueger

, James

, Ewels

, et al. FelixKrueger/TrimGalore: V0.6.7-DOI via Zenodo. 2021

35.

Lai

, Ha

, Shah

HMMcopy: Copy Number Prediction with Correction for GC and Mappability Bias for HTS Data. R Package Version 1.38.0; 2022.

36.

Lawson

, Kessenbrock

, Davis

, et al. Tumour heterogeneity and metastasis at single-cell resolution. Nat Cell Biol, 2018; 20(12):1349–1360; doi: 10.1038/s41556-018-0236-7

37.

Lin

, Ma

, Wu

. Multi-omics and artificial intelligence-guided data integration in chronic liver disease: Prospects and challenges for precision medicine. OMICS, 2022; 26(8):415–421; doi: 10.1089/omi.2022.0079

38.

Lin

, Liu

, Lee

, et al. Targeting C-Myc as a novel approach for hepatocellular carcinoma. World J Hepatol, 2010; 2(1):16–20; doi: 10.4254/wjh.v2.i1.16

39.

Liu

, Dong

, Liu

. Integrated multiple “-Omics” data reveal subtypes of hepatocellular carcinoma. PLoS One, 2016; 11(11):e0165457; doi: 10.1371/journal.pone.0165457

40.

Liu

, Liu

, Yan

, et al. Single cell profiling of primary and paired metastatic lymph node tumors in breast cancer patients. Nat Commun, 2022; 13(1):1–17; doi: 10.1038/s41467-022-34581-2

41.

Lorah

, Womack

. Value of sample size for computation of the Bayesian information criterion (BIC) in multilevel modeling. Behav Res Methods, 2019; 51(1):440–450; doi: 10.3758/s13428-018-1188-3

42.

Madjd

, Spendlove

, Pinder

, et al. Total loss of MHC class I is an independent indicator of good prognosis in breast cancer. Int J Cancer, 2005; 117(2):248–255; doi: 10.1002/ijc.21163

43.

Marrero

, Kudo

, Bronowicki

J-P

. The challenge of prognosis and staging for hepatocellular carcinoma. Oncologist, 2010; 15(S4):23–33; doi: 10.1634/theoncologist.2010-S4-23

44.

Mizuki

, Ando

, Kimura

, et al. Nucleotide sequence analysis of the HLA class I region spanning the 237-Kb segment around the HLA-B and -C genes. Genomics, 1997; 42(1):55–66; doi: 10.1006/geno.1997.4708

45.

Moon

, Ro

. MAPK/ERK signaling pathway in hepatocellular carcinoma. Cancers (Basel), 2021; 13(12):3026; doi: 10.3390/cancers13123026

46.

Neuberg

LG.

Causality: Models, reasoning, and inference, by Judea Pearl, Cambridge University Press, 2000. Econ Theory, 2003; 19(4):675–685; doi: 10.1017/S0266466603004109

47.

Ono

, Arai

, Furukawa

, et al. Single-cell DNA and RNA sequencing reveals the dynamics of intra-tumor heterogeneity in a colorectal cancer model. BMC Biol, 2021; 19(1):207; doi: 10.1186/s12915-021-01147-5

48.

Pearl

Causality: Models, Reasoning, and Inference. Cambridge University Press: Cambridge, United Kingdom; 2000.

49.

Petralia

, Wang

, Yang

, et al. Integrative random forest for gene regulatory network inference. Bioinformatics, 2015; 31(12):i197–i205; doi: 10.1093/bioinformatics/btv268

50.

Pollino

, Henderson

Bayesian Networks: A Guide for Their Application in Natural Resource Management and Policy. Landscape Logic, Technical Report; 2010.

51.

Pruitt

, Tatusova

, Brown

, et al. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res, 2012; 40(Database Issue):D130–D135; doi: 10.1093/nar/gkr1079

52.

Puga

, Krzywinski

, Altman

. Bayesian networks. Nat Methods, 2005; 12:799–800; doi: 10.1038/nmeth.3550

53.

Ramnath

, Tan

, Li

, et al. Is downregulation of MHC class I antigen expression in human non-small cell lung cancer associated with prolonged survival?. Cancer Immunol Immunother, 2006; 55(8):891–899; doi: 10.1007/s00262-005-0085-7

54.

Rodin

, Boerwinkle

. Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma ApoE levels). Bioinformatics, 2005; 21(15):3273–3278; doi: 10.1093/bioinformatics/bti813

55.

Scutari

Learning Bayesian networks with the Bnlearn R package. J Stat Softw, 2010; 35(3):1–22; doi: 10.18637/jss.v035.i03

56.

Shanbhogue

, Prasad

, Takahashi

, et al. Recent advances in cytogenetics and molecular biology of adult hepatocellular tumors: Implications for imaging and management. Radiology, 2011; 258(3):673–693; doi: 10.1148/radiol.10100376

57.

Shen

, Xiong

, Gu

, et al. Multi-omics integrative analysis uncovers molecular subtypes and mRNAs as therapeutic targets for liver cancer. Front Med (Lausanne), 2021; 8:65635; doi: 10.3389/fmed.2021.654635

58.

Song

, Weinstein

HNW

, Allegakoen

, et al. Single-cell analysis of human primary prostate cancer reveals the heterogeneity of tumor-associated epithelial cell states. Nat Commun, 2022; 13(1):1–20; doi: 10.1038/s41467-021-27322-4

59.

Speetjens

, de Bruin

, Morreau

, et al. Clinical impact of HLA class I expression in rectal cancer. Cancer Immunol Immunother, 2008; 57(5):601–609; doi: 10.1007/s00262-007-0396-y

60.

Strazzabosco

, Fabris

Notch signaling in hepatocellular carcinoma: Guilty in association! Gastroenterology, 2012;143(6):1430–1434; doi: 10.1053/j.gastro.2012.10.025

61.

Sun

, Lin

, Ji

, et al. Dysfunction of sister chromatids separation promotes progression of hepatocellular carcinoma according to analysis of gene expression profiling. Front Physiol, 2018; 9:1019; doi: 10.3389/fphys.2018.01019

62.

Tini

, Marchetti

, Priami

, et al. Multi-omics integration-a comparison of unsupervised clustering methodologies. Brief Bioinform, 2019; 20(4):1269–1279; doi: 10.1093/bib/bbx167

63.

Tong

Y-F

, He

Q-E

, Zhu

J-X

, et al. Multi-omics differential gene regulatory network inference for lung adenocarcinoma tumor progression biomarker discovery. AIChE J, 2022; 68(4):e17574; doi: 10.1002/aic.17574

64.

Townes

, Hicks

, Aryee

, et al. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol, 2019; 20(1):295; doi: 10.1186/s13059-019-1861-6

65.

Trapnell

, Roberts

, Goff

, et al. Differential gene and transcript expression analysis of RNA-Seq experiments with TopHat and cufflinks. Nat Protoc, 2012; 7(3):562–578; doi: 10.1038/nprot.2012.016

66.

Velazquez-Villarreal

, Maheshwari

, Sorenson

, et al. Single-cell sequencing of genomic DNA resolves sub-clonal heterogeneity in a melanoma cell line. Commun Biol, 2020; 3(1):318; doi: 10.1038/s42003-020-1044-8

67.

Villanueva

, Alsinet

, Yanger

, et al. Notch signaling is activated in human hepatocellular carcinoma and induces tumor formation in mice. Gastroenterology, 2012; 143(6):1660–1669; doi: 10.1053/j.gastro.2012.09.002

68.

Waly Raphael

, Yangde

, Yuxiang

. Hepatocellular carcinoma: Focus on different aspects of management. ISRN Oncol, 2012; 2012:421673; doi: 10.5402/2012/421673

69.

Woo

, Choi

J-H

, Yoon

, et al. Integrative analysis of genomic and epigenomic regulation of the transcriptome in liver cancer. Nat Commun, 2017; 8(1):839; doi: 10.1038/s41467-017-00991-w

70.

, Su

. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 2015; 31(12):1974–1980; doi: 10.1093/bioinformatics/btv088

71.

, Zhu

, Xu

, et al. Non-invasive analysis of genomic copy number variation in patients with hepatocellular carcinoma by next generation DNA sequencing. J Cancer, 2015; 6(3):247–253; doi: 10.7150/jca.10747

72.

Yan

, Risacher

, Shen

, et al. Network approaches to systems biology analysis of complex disease: Integrative methods for multi-Omics data. Brief Bioinform, 2018; 19(6):1370–1381; doi: 10.1093/bib/bbx066

73.

Yang

, Liu

. Targeting the Ras/Raf/MEK/ERK pathway in hepatocellular carcinoma. Oncol Lett, 2017; 13(3):1041–1047; doi: 10.3892/ol.2017.5557

74.

, Wang

L-G

, Han

, et al. ClusterProfiler: An R package for comparing biological themes among gene clusters. OMICS, 2012; 16(5):284–287; doi: 10.1089/omi.2011.0118

75.

Zarayeneh

, Ko

, Oh

, et al. Integration of multi-Omics data for integrative gene regulatory network inference. Int J Data Min Bioinform, 2017; 18(3):223–239; doi: 10.1504/IJDMB.2017.10008266

76.

Zhang

, Li

, Feng

, et al. Progressive and prognosis value of Notch receptors and ligands in hepatocellular carcinoma: A systematic review and meta-analysis. Sci Rep, 2017; 7(1):14809; doi: 10.1038/s41598-017-14897-6

77.

Zhou

, Zhang

, Chen

, et al. Integrated analysis of copy number variations and gene expression profiling in hepatocellular carcinoma. Sci Rep, 2017; 7(1):10570; doi: 10.1038/s41598-017-11029-y

78.

Zhu

, Ho

Y-J

, Salomao

, et al. Notch activity characterizes a common hepatocellular carcinoma subtype with unique molecular and clinicopathologic features. J Hepatol, 2021; 74(3):613–626; doi: 10.1016/j.jhep.2020.09.032

79.

Zimonjic

, Popescu

. Role of DLC1 tumor suppressor gene and MYC oncogene in pathogenesis of human hepatocellular carcinoma: Potential prospects for combined targeted therapeutics (review). Int J Oncol, 2012; 41(2):393–406; doi: 10.3892/ijo.2012.1474

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

0.46 MB

0.18 MB

0.13 MB

0.11 MB

0.23 MB

0.01 MB

0.02 MB