Effective Integration of Single-Cell Multi-Omics Data Using Improved Network-Based Integrative Clustering with Multigraph Regularization

Abstract

The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer’s disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset’s structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.

1. INTRODUCTION

Nowadays, single-cell multi-omics datasets offer unprecedented opportunities for cell clustering from multiple perspectives. Joint analysis using multi-omics data can provide insights into intermolecular regulation and causal relationships, and characterize intercellular heterogeneity from various perspectives. Alzheimer’s disease (AD) is a complex neurodegenerative disorder, and the precise elucidation of its pathogenic mechanisms is crucial for research and treatment. The limitations of current data clustering methods have driven continuous optimization and innovation in multi-omics data analysis algorithms tailored for AD. The innovation and application of multi-omics algorithms will provide powerful tools for AD research.

The main difference between current clustering algorithms (Dey et al., 2021; Dong et al., 2021; Li et al., 2021) is how the cellular features are extracted and how the clusters are selected. K-means (Grün et al., 2015) and its variant Clustering through Imputation and Dimensionality Reduction (CIDR) perform clustering of cells by constantly updating the centers of similar cell samples, however, these algorithms struggle to handle the heterogeneity of the data properly (Lin et al., 2017). Multi-omics factor analysis (MOFA) ignores relationships in multi-omics data (Argelaguet et al., 2018). Multi-omics factor analysis v2 (MOFA+) reconstructs a low-dimensional representation of the data to cluster the data (Argelaguet et al., 2020). It is unsuitable for sparse or noisy data and may lead to poorer results. It is an effective and advanced tool for characterizing many types of measurements and dissecting the cellular heterogeneity of the data (Jin et al., 2020). However, scAI is less suitable for large-scale datasets due to longer computation times and higher computational resource requirements.

The NIC algorithm circumvents the drawbacks mentioned in the above algorithms (Li et al., 2022). To address the above issues, NIC is designed to handle large-scale datasets. By leveraging efficient computational techniques, NIC can process large datasets within reasonable timeframes. NIC automatically constructs similarity networks of cells for each data based on adaptive learning. The data are downscaled using an orthogonal projection matrix, followed by the decomposition of the cellular similarity network using the nonnegative matrix factorization (NMF) to obtain the cellular features of each data (Wu et al., 2022). Finally, NIC utilizes the shared features of cells to identify cell types and extract genes. Although the algorithm has a lot going for it, the NIC algorithm is susceptible to the input data, and some outliers or noise in the dataset may significantly impact the clustering results, leading to unstable clustering results. Given a data matrix, NMF in the algorithm can find the two low-rank matrices, but they may ignore potentially helpful information about the data space (Li et al., 2022). Based on the existing foundation of the NIC algorithm, we add graph regularization constraints to extract data feature information. By leveraging the connections and similarity between samples, high-dimensional geometrical structure is captured in the matrix, and noise is reduced. We propose a comprehensive single-cell analysis based on multi-graph regularization and network-based integrative clustering algorithm (MGR-NIC) to identify the cell types of the data, which better preserves the structure and relationships of the data.

Experiments on six multi-omics datasets demonstrate that MGR-NIC achieves more consistent accuracy compared to other state-of-the-art multi-omics clustering algorithms.

2. METHODS AND MATERIALS

2.1. Workflow of this study

This study consists of two parts and the workflow is shown in Figure 1. The first part is our proposed MGR-NIC algorithm, which consists of three main steps: the first step of the algorithm is adaptive graph learning, followed by integrative analysis, and cell clustering finally (Fig. 1A). The integration analysis section of the NIC algorithm relies on NMF. However, NMF lacks proper regularization of the basis matrix, resulting in features that are not easy to retain (Salahian et al., 2023). When the dataset is very sparse or if there are few co-occurrence relations for certain features, the NMF algorithm may not be able to capture the pattern and structure of the dataset well. This may result in missing important feature information in the decomposition results. To address these issues, as shown by the red marking GR in Figure 1A, we add two constraints γtr( B ^[ ^l ^]′ L ^[ ^l ^] B ^[ ^l ^]) and βtr( H ^[ ^l ^]′ L ^[ ^l ^] H ^[ ^l ^]) for basis matrix improves the overall clustering performance.

FIG. 1.

MGR-NIC algorithm overview. (A) MGR-NIC flowchart. (B) Comparison of the clustering results. MGR-NIC, multi-graph regularization and network-based integrative clustering algorithm.

The second part uses diverse comparative experiments on the performance of the proposed MGR-NIC algorithm (Fig. 1B). Parameter sensitivity analyses were conducted to validate the stability of the algorithms using datasets of AD and control samples [Figure 1B (a)]. In this paper, the visualization capabilities of the five algorithms are analyzed through data visualization performance comparisons [Figure 1B (b)]. Sankey diagrams are used to track the data transfer between clusters. They are used in this paper to validate and compare the differences before and after the algorithm improvement [Figure 1B (c)]. Also considering the comprehensive clustering performance of the algorithms, six clustering evaluation metrics, accuracy (ACC), adjusted Rand Index (ARI), F1-score (F), recall (R), precision (P), and Purity were used to verify the performance of MGR-NIC on different histologies and algorithms [Figure 1B (d) and (e)].

The algorithms proposed in this paper are useful for integrating various types of data, such as genomics, transcriptomics, proteomics, and epigenetics, including scRNA-seq, CITE-seq, scATAC-seq, and others. Researchers can integrate gene expression data and protein expression data for tumor sample classification and the discovery of associations between genes and proteins. On the other hand, integrating genomic data and metabolomic data can help reveal the associations between metabolic abnormalities and gene mutations, expanding our understanding of disease mechanisms. This multi-omics integration approach contributes to a deeper understanding of the complexity of biological systems and advances biomedical research.

After the completion of multi-omics data clustering, subsequent analyses can be performed on different clusters for cell type identification, gene regulatory network analysis, and pathway analysis, which lays a good foundation for exploring cellular interactions and pathogenic mechanisms.

2.2. Methods

Given the normalized RNA-seq matrix $X^{[1]} \in R^{d_{1} * n}$ and ATAC-seq matrix $X^{[2]} \in R^{d_{2} * n}$ obtained from the same cells, where d 1, d 2, and n are genes, peaks (regions), and cells. The adaptive graph learning component is designed to learn a consensus graph that represents the similarity between cell samples and obtain similarity matrices $S^{[l]} = {(s_{ij}^{[l]})}_{n * n}$ , which $s_{i j}^{[l]}$ denote the similarity between different cells. The integrative analysis employs NMF to decompose matrix S ^[ ^l ^] into independent basic matrices ( B ^[ ^l ^]) and feature matrices ( F ), learning a common feature matrix for both datasets. MGR-NIC leverages cell characteristics to identify cell types.

The integration analysis section uses NMF. NMF belongs to unsupervised learning techniques and is commonly used for recognizing biological modular networks and cell clustering (Lee and Seung, 1999). In short, a NMF is the decomposition of a large nonnegative matrix into two small nonnegative matrices.

Given a dataset in matrix form X , in which each entry is nonnegative, the NMF can decompose it into two matrices as follows ${}_{W, H}^{min}‖ X - WH ‖_{F}^{2}$ (1)

W , H ≥ 0, W is the basic matrix, H is the coefficient matrix, and $‖ \cdot ‖_{F}$ denotes the Frobenius norm of the matrix. Also, on the extraction of cell features issue, two similarity matrices were subjected to NMF, Equation (2) is formulated as $O (G, B, F) = \sum_{l = 1}^{2} ‖ S^{[l]} - B^{[l]} F ‖^{2}$ (2)

The projection of a multilayered network into a common subspace is a data representation method that integrates multiple layers, allowing for the synthesis of information from different levels to find consistent data table evidence (Ma et al., 2019). The NIC algorithm considers the association between two data and uses this type of approach to find the cellular shared features of the two data.

The NIC utilizes the matrices of the higher order topological index PMI (Li et al., 2018). 0 ≤ S ^[ ^l ^] ≤ 1, α is the parameter of the regularization term. In summary, NIC is formulated as $\begin{matrix} minO = \begin{matrix} argmin \\ S^{[l]} \end{matrix} \sum_{i, j} ‖ P^{[l]} x_{. i}^{[l]} - P^{[l]} x_{. j}^{[l]} ‖^{2} s_{ij}^{[l]} + ‖ S^{[l]} - B^{[l]} F ‖^{2} + α ‖ S^{[l]} ‖^{2} \\ + ‖ M^{[l]} - H^{[l]} F ‖^{2} \end{matrix}$ (3) $s.t. P^{[l]} {(P^{[l]})}^{'} = I, B^{[l]} \geq 0, H^{[l]} \geq 0, F \geq 0 .$

However, NMF may be sensitive to different data quality or different algorithms. (Kim and Park, 2007). To solve these problems, in this study, the enforces graph regularization (GR) on the basis matrix is introduced in each modality.

Graph Regularizer. Under certain assumptions, if two sample points are close in high-dimensional space, they will also remain close in low-dimensional space (Shu et al., 2022). The distance measure is as in Equation (4) $\begin{matrix} d (v_{i}, v_{j}) = ‖ v_{i} - v_{j} ‖^{2} \end{matrix}$ (4)

If two sample points are connected, then an edge connecting the two is generated. The calculation of the weight of this edge. The equation is as follows $\begin{matrix} W_{ij} = {\begin{matrix} e^{\frac{- x_{i} + y_{j}^{2}}{2 σ^{2}}} \\ 0, other . \end{matrix} \end{matrix}$ (5)

σ represents the trade-off parameter. If a pair of cells has similar expression profiles, then they are well connected in the network and vice versa. Thus, the objective is $\begin{array}{l} R = \frac{1}{2} \sum_{i, j = 1}^{n} ‖ v_{. i} - v_{. j} ‖^{2} W_{ij} \\ = \sum_{i = 1}^{n} v_{. i}^{T} v_{. i} D_{ii} - \sum_{i, j = 1}^{n} v_{. i}^{T} v_{. j} D_{ij} \\ \begin{matrix} = T r (VD V^{T}) - T r (VW V^{T}) = T r (VL V^{T}) \end{matrix} \end{array}$ (6)

L ^[ ^l ^] is the graph Laplacian matrix. L ^[ ^l ^] = D ^[ ^l ^] − W ^[ ^l ^], D is a diagonal matrix. The NIC algorithm proposes adaptive learning to construct cellular similarity networks. However, most articles apply it as (Dai et al., 2020) place graph regularization on the shared basis matrix or coefficient matrix when integrating multi-omics data.

This paper further extends the application of adaptive learning. We are concerned that some of the attributes or features of the shared feature matrix are obtained based on transformations or combinations of the underlying matrix. Potential structural information is underutilized in the data (Li et al., 2022). For the MGR-NIC algorithm to retain valid information about the data space and to fully utilize the potential local structural information in the data, we did not focus on the shared feature matrix, but instead constructed regular similarity map matrices for both basic matrices in the data space as shown in Equations (7) and (8) $\begin{matrix} \frac{1}{2} \sum_{i, j = 1}^{n} ‖ b_{i .} - b_{j .} ‖^{2} W_{ij} = T r (B^{'} LB) \end{matrix}$ (7) $\frac{1}{2} \sum_{i, j = 1}^{n} ‖ h_{i .} - h_{j .} ‖^{2} W_{ij} = T r (H^{'} LH)$ (8)

Equation (3) can be reformulated as $\begin{matrix} minO = \begin{matrix} argmin \\ S^{[l]} \end{matrix} \sum_{i, j} ‖ P^{[l]} x_{. i}^{[l]} - P^{[l]} x_{. j}^{[l]} ‖^{2} s_{ij}^{[l]} + ‖ S^{[l]} - B^{[l]} F ‖^{2} + α ‖ S^{[l]} ‖^{2} \\ + ‖ M^{[l]} - H^{[l]} F ‖^{2} + γ t r (B^{{[l]}^{'}} L^{[l]} B^{[l]}) + β t r (H^{{[l]}^{'}} L^{[l]} H^{[l]}) \end{matrix}$ (9) $s.t. P^{[l]} {(P^{[l]})}^{'} = I, B^{[l]} \geq 0, H^{[l]} \geq 0, F \geq 0 .$ where the parameters γ and β determine the importance of Laplace regularization. The Laplace matrix L ^[ ^l ^] introduces the topology of the graph to constrain the data decomposition process, which can capture geometric structure information in the two underlying matrix data spaces (Cai et al., 2011). By applying the constraints γtr( B ^[ ^l ^]′ L ^[ ^l ^] B ^[ ^l ^]) and βtr( H ^[ ^l ^]′ L ^[ ^l ^] H ^[ ^l ^]), it can ensure that the basis matrix preserves or enhances the connectivity between data samples. This helps to ensure a high concentration of similar cell samples in the medium and that the clustering results better characterize the data.

2.3. Optimization and update rules

Equation (9) is non-convex, we use multivariate alternating update rules to optimize individual variables. The algorithm reaches convergence faster through an iterative strategy. Let Ψ^[ ^l ^] and Φ be the Lagrange multiplier for constraint B ^[ ^l ^], H ^[ ^l ^] ≥ 0, and F ≥ 0. P ^[ ^l ^] and S ^[ ^l ^] are consistent with the original text as shown in Equations (10) and (11). $P^{[l]} = \begin{matrix} argmin \\ P^{[l]} {(P^{[l]})}^{'} = I \end{matrix} \sum_{i, j} ‖ P^{[l]} x_{. i}^{[l]} - P^{[l]} x_{. j}^{[l]} ‖^{2} s_{ij}^{[l]}$ (10) $S^{[l]} = \begin{matrix} argmin \\ B^{[l]} \end{matrix} \sum_{i, j} ‖ P^{[l]} x_{. i}^{[l]} - P^{[l]} x_{. j}^{[l]} ‖^{2} s_{ij}^{[l]} + ‖ S^{[l]} - B^{[l]} F ‖^{2} + α ‖ S^{[l]} ‖^{2}$ (11)

By fixing S ^[ ^l ^] and F , MGR-NIC algorithm Equation (12) is formulated as $B^{[l]} = \begin{matrix} argmin \\ B^{[l]} \end{matrix} ‖ S^{[l]} - B^{[l]} F ‖^{2} + T r (θ^{[l]} {(B^{[l]})}^{'}) + γ t r (B^{{[l]}^{'}} L^{[l]} B^{[l]})$ (12)

By fixing F , MGR-NIC algorithm Equation (13) is formulated as $H^{[l]} = \begin{matrix} argmin \\ B^{[l]} \end{matrix} ‖ M^{[l]} - H^{[l]} F ‖^{2} + β t r (H^{{[l]}^{'}} L^{[l]} H^{[l]}) + T r (Ψ^{[l]} {(H^{[l]})}^{'})$ (13)

By fixing S ^[ ^l ^], B^[ ^I ^], and H^[ ^l ^], MGR-NIC algorithm Equation (14) is formulated as $F = \begin{matrix} argmin \\ F \end{matrix} \sum_{l} (‖ S^{[l]} - B^{[l]} F ‖^{2} + ‖ M^{[l]} - H^{[l]} F ‖^{2}) + T r (Φ F^{'})$ (14)

The γtr( B ^[ ^l ^]′ L ^[ ^l ^] B ^[ ^l ^]) and βtr( H ^[ ^l ^]′ L ^[ ^l ^] H ^[ ^l ^]) do not affect other variables when iterative optimization is performed on the variable. So, we optimize B ^[ ^l ^], H ^[ ^l ^], and F by fixing the other variables. The update rule for B ^[ ^l ^] is obtained as $B^{[l]} \leftarrow B^{[l]} \frac{S^{[l]} F^{'} + γ W^{[l]} B^{[l]}}{B^{[l]} {FF}^{'} + γ D^{[l]} B^{[l]}}$ (15)

The update rule for H ^[ ^l ^] is obtained as $H^{[l]} \leftarrow H^{[l]} \frac{M^{[l]} F^{'} + β W^{[l]} B^{[l]}}{H^{[l]} {FF}^{'} + β D^{[l]} B^{[l]}}$ (16)

Same for other variables, by fixing S ^[ ^l ^], B ^[ ^l ^], and H ^[ ^l ^], F formulated as $F^{[l]} \leftarrow F^{[l]} \frac{\sum_{l} (B^{[l]})^{'} S^{[l]} F^{'} + \sum_{l} (H^{[l]})^{'} M^{[l]}}{\sum_{l} (B^{[l]})^{'} B^{[l]} + \sum_{l} (H^{[l]})^{'} H^{[l]} F}$ (17)

2.4. Data preprocessing

The AD dataset was downloaded from the National Center for Biological Information Technology Information (NCBI) database. The downloaded dataset number is GSE274979. Gene expression and chromatin accessibility were in 105,332 cell samples. This article (Anderson et al., 2023) normalized and dimensionality reduced both data using Seurat (Hao et al., 2021) and Signac (Stuart et al., 2021), respectively. Similarly, this paper performs a similar operation on data from the source database. Cells with <300 genes, >3,000 genes, or >5% mitochondrial genes were filtered out. ATAC data is processed similarly and takes the intersection of the cells of both data. A total of 5,331 filtered cell samples were selected for analysis. After data normalization, 6,000 highly variable genes with the variance stabilizing transformation method for gene expression data, and 12,000 chromosome location information were selected for chromatin accessibility data. This leads to the creation of the DFPLC dataset for this paper.

2.5. Evaluation metrics

We used six clustering evaluation metrics to validate the performance of our MGR-NIC algorithm: ACC, ARI, F, R, P, and Purity as shown in Figures 6 and 7. For the evaluation of the two simulated datasets, we employed multiple metrics including ACC, ARI, F, R, P, and Purity (Fig. 6). In the case of the four real datasets, we utilized ACC, ARI, P, and Purity as evaluation metrics (Fig. 7). The six indicators and the mathematical meaning of each indicator are shown below.

ACC calculates the percentage of correctly predicted cell type labels in cell clustering results $c_{j}^{*}$ , and the true cell type $c_{j}$ obtained by the algorithm. n is the number of cells. c* is truth cluster labels, and c is the label from the algorithm. ARI (Hubert and Arabie, 1985) was used to determine the validity of the clustering results. Where n represents the total number of single-cell samples, n_q and n_t are the number of predicted clustering results q and the true cell type t, and n_qt is the number shared by q and t (Wu et al., 2022). Accuracy and ARI are defined as $Accuracy = \frac{1}{n} \sum_{j}^{n} δ (c_{j}, c_{j}^{*})$ (18) $ARI = \frac{\sum_{q, t} (\begin{matrix} n_{q t} \\ 2 \end{matrix}) - [\sum_{q} (_{2}^{n_{q}}) \sum_{t} (_{2}^{n_{t}})] / (_{2}^{n})}{\frac{1}{2} [\sum_{q} (_{2}^{n_{q}}) + \sum_{t} (_{2}^{n_{t}})] - [\sum_{q} (_{2}^{n_{q}}) \sum_{t} (_{2}^{n_{t}})] / (_{2}^{n})}$ (19)

Purity is the degree of purity of sample categorization in the clustering results. Where K _i is the specific cluster size of n_i. Purity is computed as the following: $\begin{matrix} Purity = \sum_{i = 1}^{c} \frac{n_{i}}{n} P \end{matrix} (K_{i})$ (20)

The other three metrics for evaluating clustering performance are defined below: $Precision = \frac{TP}{{TP+FP}^{'}}$ (21) $Recall = \frac{TP}{TP + F N^{'}}$ (22) $F = \frac{2}{1/precision+1/recall}$ (23)where the number of true positives is denoted by TP, the number of false positives is denoted by FP, and FN is the number of false negatives (Dai et al., 2020). Six metrics were used to assess the comprehensive performance and accuracy of the clustering results. The closer the values of the six metrics are to 1, the higher the quality of the clustering.

2.6. Selection of parameters

Our proposed MGR-NIC method incorporates four parameters: k, α, γ, and β. Parameter α controls the regularization term, k represents the number of features, γ and β are the graph regularization parameters. In particular, we varied the values of the two graph regularization parameters, while keeping the other parameters fixed for the datasets. Based on previous studies (Wu et al., 2022), it was observed that when α ≥ 2, NIC realizes favorable trade-offs. The algorithm performs best when the number of features is close to the number of cell types. To determine the optimal values for γ and β, we employed a grid search method and tested them within the range of [0, 1, 2, 3, 4, 5].

2.7. Cell clustering

To intuitively demonstrate the clustering performance of the algorithm, we employed t-distributed stochastic neighbor embedding (t-SNE) to cluster and visualize the dataset. The cell type labels provided in the dataset were used in the analysis.

3. RESULTS AND DISCUSSION

3.1. Data source

Simulated data 1 and simulated data 2 were generated using the MOSim package (Carlos et al., 2018) based on the same samples, simulating two single-cell datasets.

We demonstrate these methods using publicly available 12,000 human peripheral blood cells (PBMC) multi-omics data from the website (https://satijalab.org/seurat/articles/atacseq_integration_vignette). In real multi-omics datasets, two input data were simultaneously collected in the same cells. Mouse brain data and mouse kidney data are provided in the literature (Ma et al., 2022). CAR is a mouse kidney dataset from the same cells. For evaluation datasets on real datasets, including the sci-CAR (Cao et al., 2018) mouse kidney dataset, the mouse brain dataset (Zhu et al., 2019), in which gene expression and one of five histone modifications in the same cell were analyzed, we selected one histone modification, H3K4me3 to evaluate. The statistical information of the dataset used for the algorithm is shown in Table 1.

Table 1.
Statistical Information of Six Datasets

Datasets Cells Type Organ Omics

Simulated data 1 500 4 — RNA+DNase-seq

Simulated data 1 500 5 — RNA+DNase-seq

PBMC 10,412 19 Blood RNA+ATAC

Mouse brain 2,684 20 Mouse brain RNA+H3K4me3

CAR 4,753 14 Mouse kidney RNA+ATAC

DLPFC 5,531 8 DLPFC RNA+ATAC

Datasets	Cells	Type	Organ	Omics
Simulated data 1	500	4	—	RNA+DNase-seq
Simulated data 1	500	5	—	RNA+DNase-seq
PBMC	10,412	19	Blood	RNA+ATAC
Mouse brain	2,684	20	Mouse brain	RNA+H3K4me3
CAR	4,753	14	Mouse kidney	RNA+ATAC
DLPFC	5,531	8	DLPFC	RNA+ATAC

PBMC, peripheral blood cells.

3.2. Parameter sensitivity analysis

The DLPFC dataset includes AD cell samples and Ctrl cell samples. In this paper, 5,531 cell samples were divided into AD and Ctrl groups. Parameterization experiments were performed separately, and the MGR-NIC algorithm was used for multigroup joint clustering and parameter sensitivity analysis. Parameter stability can be evaluated by analyzing the algorithm’s performance across various parameter settings. For instance, in this study, the algorithm was executed with a diverse set of parameter combinations to monitor the fluctuations in its clustering accuracy. If the algorithm consistently delivers stable clustering accuracy results across different parameter configurations, characterized by a consistent trend, it can be concluded that the algorithm exhibits robust parameter stability.

From Figure 2, The darker the color of the bar graph, the more accurate the clustering. It is evident that regardless of the variations in parameters γ or β, the clustering accuracy values of the algorithm remain nearly constant across both the AD group (Fig. 2A) and the Ctrl group (Fig. 2B). Therefore, the clustering performance of the MGR-NIC algorithm demonstrates stability.

FIG. 2.

Experimental results of parameter sensitivity. (A) sensitivity of parameter γ/β of the AD dataset. (B) sensitivity of parameter γ/β of the Ctrl dataset. AD, Alzheimer’s disease.

To demonstrate the robustness of the MGR-NIC algorithm, this paper selects multiple datasets for validation. Specifically, considering that simulated datasets are synthetically generated based on predefined models or rules, while real datasets exhibit more complex characteristics, we conduct parameter sensitivity analysis experiments on two simulated datasets and five real datasets. By adjusting algorithm parameters across six distinct datasets, the relationship between parameter configurations and algorithmic accuracy can be effectively determined. When γ ≥ 4 and β ≥ 4, the MGR-NIC algorithm exhibits stable and superior clustering performance on most multi-omics datasets, as shown in Table 2.

Table 2.

The Parameters of All Datasets

Dataset\Parameter	γ	β
Simulated data 1	2/3/4/5	1/2/3/4/5
Simulated data 2	1/2/3/4/5	1/2/3/4/5
PBMC	4/5	1/2/3/4/5
Mouse brain	1/4/5	5
CAR	5	5
DLPFC	1/2/3/4/5	1/2/3/4/5

3.3. Cell-clustering results

GSE214979 is a multi-omics dataset related to AD. The dataset has eight different clustering labels to represent various cell types. To compare the clustering results of the other four algorithms with the improved MGR-NIC algorithm in this paper more intuitively, we used t-SNE to visualize the DLPFC and PBMC datasets. scAI and MGR-NIC keep the same number of iterations (100) and other parameters by default. Both JSNMF and MGR-NIC algorithms use adaptive graph learning, and for the sake of comparison, keep the number of iterations the same (100), the number of neighbors the same (5), and other parameters by default.

As shown in Figure 3, for the DLPFC dataset, the ability of scAI and JSNMF to distinguish cell populations is more prominent, while MGR-NIC performs equally well. The NIC algorithm faces difficulties distinguishing similar cell types due to numerous different cell types within a single cell cluster. In contrast, the MGR-NIC algorithm efficiently subdivides cell clusters into 6 and 8, while the NIC algorithm mixes them. Compared to the NIC algorithm, MGR-NIC enhances the ability to differentiate cell clusters, particularly by significantly reducing the mixing of cell clusters (1, 6, 7, 8).

FIG. 3.

The t-SNE visualization of cells from DLPFC data (RNA+ATAC). t-SNE, t-distributed stochastic neighbor embedding.

NIC algorithm heavily mixes four cell types (1, 6, 7, 8), whereas MGR-NIC only mixes a minimal amount of other cells but can differentiate between different cell types. As depicted in Figure 3, the clustering performance of the MGR-NIC algorithm surpasses that of NIC, indicating that it captures useful feature information overlooked by the previous algorithm, thereby providing a more accurate description of cell types.

From Figure 4, it can be observed that for the PBMC dataset, scAI struggles to differentiate the data into distinct clusters under the same number of iterations. While MOFA+ performs clustering on the data, it fails to adhere to the principle where similar cell types should be closer and dissimilar ones should be distant. In comparison to the previous two algorithms, JSNMF exhibits significant differences in cell clustering, especially in clusters 2, 5, 10, and 16. However, in the clustering of clusters 1, 3, 4, 9, 11, and 14, MGR-NIC and NIC demonstrate clear advantages, with MGR-NIC particularly excelling in distinguishing cell types. Therefore, it can be inferred that MGR-NIC exhibits the strongest ability to differentiate cell populations in the PBMC dataset.

FIG. 4.

The t-SNE visualization of cells from PBMC data (RNA+ATAC). PBMC, peripheral blood cells.

To observe the difference in clustering performance between MGR-NIC and NIC on the DLPFC dataset, this study employs Sankey diagrams for a visual comparison. The left side of Figure 5A and 5B shows the true cell type labels provided by the original data, which are used to validate the algorithm performance. Both result diagrams indicate that the clustering results of the MGR-NIC algorithm closely match the true clustering assignments, with only a few cells being clustered into a single class. Figure 5A illustrates the shortcomings of NIC in clustering DLPFC multi-omics data, with many instances of significant crossover and mixing in most classification outcomes.

FIG. 5.

The Sankey diagram compares the clustering results of the NIC and the MGR-NIC. MGR-NIC, multi-graph regularization and network-based integrative clustering algorithm.

In contrast, Figure 5B demonstrates that the clustering results of MGR-NIC align well with the cell samples provided in the original study, with only a small number of cells classified into other types.

3.4. Comparison of algorithms

We evaluated the clustering performance of MGR-NIC on a total of six multi-omics datasets, which included two simulated datasets and four real multi-omics datasets. For a comprehensive comparison, we compared three excellent single-cell multi-omics data integration methods with our algorithm, namely NIC, scAI, MOFA+, and JSNMF.

scAI is a method for MOFA, which is specifically designed for parallel single-cell integrated analysis. The MOFA+ algorithm provides a low-dimensional representation of cells but does not include a built-in clustering method. Therefore, we clustered the samples based on the values of the latent factors obtained from the MOFA+ analysis. JSNMF is a robust algorithm designed to efficiently reduce dimensionality and uncover synergistic relationships in multi-omics data. In both scAI and MOFA+, we used the default parameters of the above algorithm.

By including these established methods in our evaluation, we aimed to provide a comprehensive and fair comparison of MGR-NIC clustering performance against the state-of-the-art approaches across these diverse datasets.

As depicted in Figure 6, it is evident that our proposed MGR-NIC algorithm exhibits excellent performance on two simulated datasets. It indicates that even when the NIC clustering is already proficient, MGR-NIC can further enhance the results. For instance, in the case of simulated data 1, the ACC value increases from 0.99 to 0.998, while the ARI value improves from 0.970 to 0.994. These achievements are equivalent to the state-of-the-art multi-omics clustering algorithm.

FIG. 6.

Evaluating clustering performance on two simulated multi-omics datasets. (A) Comparison of four methods in simulated data 1. (B) Comparison of four methods in simulated data 2.

While simulated datasets can be used to validate the performance of the algorithm, we expect MGR-NIC to be equally good on real multi-omics data and validate it using four multi-omics datasets from different organizations.

The clustering effectiveness of MGR-NIC exhibited substantial improvement compared to the original NIC algorithm. In Figure 7A, the ACC of MGR-NIC on the DLPFC dataset is found to be 0.933, while the NIC is 0.678, which exerts a definite advantage over the scAI, MOFA+, JSNMF. MGR-NIC outperformed other methods on CAR datasets, with 0.615 for CAR. In comparison, the JSNMF algorithm, which is next to the MGR-NIC method, has an ACC value of 0.587 for CAR datasets. The ARI, purity, and P scores similarly inform the performance of the algorithm.

FIG. 7.

Clustering performance was evaluated on four real multi-omics datasets. (A) ACC values for the four algorithms. (B) P values for the four algorithms. (C) ARI values for the four algorithms. (D) Values for the four algorithms. ACC, accuracy; ARI, adjusted Rand Index.

The NMI, ARI, P, and Purity scores of various algorithms for cell clustering on real multi-omics datasets are presented in Figure 7. As illustrated in Figure 7, MGR-NIC demonstrates superior performance in comparison with the four other methods. Across diverse datasets, MGR-NIC consistently ranks within the top two in terms of accuracy. The clustering effectiveness of MGR-NIC exhibited substantial improvement compared to the original NIC algorithm. Moreover, these results demonstrate that MGR-NIC accurately identifies cell clusters even in large-scale datasets, indicating that the constrained optimization strategy is robust to varying dataset sizes.

4. CONCLUSION

The joint analysis of multi-omics data is transforming our understanding of neurodegenerative diseases such as AD. However, the computational methods used to comprehensively analyze single-cell transcriptome and epigenome profiles are limited and ignore information about the geometric structure of the data space. To address this limitation, a comprehensive multi-omics analysis based on multi-graph regularization and network-based integrative clustering methods (MGR-NIC) is proposed in this study.

Firstly, MGR-NIC learns the respective cell similarity maps from the two histological data obtained from the same samples. Secondly, regularization constraints imposed on both the basis matrix of the similarity matrix and the basis matrix of the PMI matrix can improve the performance of the algorithm when extracting data features. This improves the ability of similar cell samples to aggregate under graph structure constraints. Third, the two visualization methods demonstrate useful information about the data learned by the MGR-NIC algorithm. To further validate the performance of MGR-NIC, six multi-omics datasets, we compared MGR-NIC with superior algorithms for multi-omics data integration, including NIC, scAI, MOFA+, and JSNMF. Multiple experimental results consistently demonstrate that the MGR-NIC is superior to NIC and MOFA+ in analyzing the same multi-omics data. MGR-NIC clustering accuracy was higher than NIC on both the multi-omics simulation dataset and the multi-omics real dataset. On both the mouse brain and peripheral blood datasets, MGR-NIC demonstrated superior performance over NIC, scAI, and MOFA+ across multiple clustering evaluation metrics. Additionally, on the DLPFC dataset, it outperformed JSNMF. The experimental results highlight the exceptional performance of the proposed algorithm, demonstrating its superior distinguishing ability.

A major advantage of the MGR-NIC is its flexibility in handling multiple types of input data. There is still a lot of room for improvement in MGR-NIC to fuse more different modal data and improve the accuracy of clustering.

Footnotes

AUTHORS’ CONTRIBUTIONS

S.Q.Z. and W.K.: Data curation, software, methodology, writing—original draft and writing—review and editing. W.K.: Formal analysis, funding acquisition. S.Q.W. and K.L.: Resources, supervision, and project administration. K.W.: Investigation, methodology. G.W. and Y.L.Y.: Supervision, validation.

DATA AVAILABILITY STATEMENT

Data will be made available on request. The source code is provided on the website ().

AUTHOR DISCLOSURE STATEMENT

The authors declare no competing interests.

FUNDING INFORMATION

This work was supported by the Natural Science Foundation of Shanghai (No. 18ZR1417200).

References

Anderson

, Rogers

, Loupe

, et al. Single nucleus multiomics identifies ZEB1 and MAFB as candidate regulators of Alzheimer’s disease-specific cis-regulatory elements. Cell Genom, 2023; 3(3):100263; doi: 10.1016/j.xgen.2023.100263

Argelaguet

, Arnol

, Bredikhin

, et al. MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol, 2020; 21(1):111; doi: 10.1186/s13059-020-02015-1

Argelaguet

, Velten

, Arnol

, et al. Multi‐OMICS factor analysis—A framework for unsupervised integration of multi‐omics data sets. Mol Syst Biol, 2018; 14(6):e8124; doi: 10.15252/msb.20178124

Cai

, He

, Han

, et al. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell, 2011; 33(8):1548–1560; doi: 10.1109/TPAMI.2010.231

Cao

, Cusanovich

, Ramani

, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science, 2018; 361(6409):1380–1385; doi: 10.1126/science.aau0730

Carlos

M-M

, Ana

, Sonia

. MOSim: Multi-OMICS Simulation in R. bioRxiv, 2018:421834; doi: 10.1101/421834

Dai

L-Y

, Zhu

, Wang

. Joint nonnegative matrix factorization based on sparse and graph Laplacian regularization for clustering and co-differential expression genes analysis. Complexity, 2020; 2020:1–10; doi: 10.1155/2020/3917812

Dey

, Sen

, Maulik

. Unveiling COVID-19-associated organ-specific cell types and cell-specific pathway cascade. Brief Bioinform, 2021; 22(2):914–923; doi: 10.1093/bib/bbaa214

Dong

, Thennavan

, Urrutia

, et al. SCDC: Bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinform, 2021; 22(1):416–427; doi: 10.1093/bib/bbz166

10.

Grün

, Lyubimova

, Kester

, et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature, 2015; 525(7568):251–255; doi: 10.1038/nature14966

11.

Hao

, Hao

, Andersen-Nissen

, et al. Integrated analysis of multimodal single-cell data. Cell, 2021; 184(13):3573–3587.e3529; doi: 10.1016/j.cell.2021.04.048

12.

Hubert

, Arabie

. Comparing partitions. J Classification, 1985; 2(1):193–218; doi: 10.1007/BF01908075

13.

Jin

, Zhang

, Nie

. scAI: An unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome Biol, 2020; 21(1):25; doi: 10.1186/s13059-020-1932-8

14.

Kim

, Park

. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 2007; 23(12):1495–1502; doi: 10.1093/bioinformatics/btm134

15.

Lee

, Seung

. Learning the parts of objects by non-negative matrix factorization. Nature, 1999; 401(6755):788–791; doi: 10.1038/44565

16.

, Dai

, Fang

, et al. c-CSN: Single-cell RNA sequencing data analysis by conditional cell-specific network. Genom Proteom Bioinform, 2021; 19(2):319–329; doi: 10.1016/j.gpb.2020.05.005

17.

, Li

, Hu

, et al. Semi-supervised bi-orthogonal constraints dual-graph regularized NMF for subspace clustering. Appl Intell, 2022; 52(3):3227–3248; doi: 10.1007/s10489-021-02522-z

18.

, Sha

, Huang

, et al. Community detection in attributed graphs: an embedding approach. AAAI, 2018; 32(1); doi: 10.1609/aaai.v32i1.11274

19.

Lin

, Troup

, Ho

JWK

. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol, 2017; 18(1):59; doi: 10.1186/s13059-017-1188-0

20.

, Dong

, Wang

. Community detection in multi-layer networks using joint nonnegative matrix factorization. IEEE Trans Knowl Data Eng, 2019; 31(2):273–286; doi: 10.1109/TKDE.2018.2832205

21.

, Sun

, Zeng

, et al. JSNMF enables effective and accurate integrative analysis of single-cell multiomics data. Brief Bioinform, 2022; 23(3):bbac105; doi: 10.1093/bib/bbac105

22.

Salahian

, Tab

, Seyedi

, et al. Deep autoencoder-like NMF with contrastive regularization and feature relationship preservation. Exp Syst Appl, 2023; 214:119051; doi: 10.1016/j.eswa.2022.119051

23.

Shu

, Long

, Zhang

, et al. Robust graph regularized NMF with dissimilarity and similarity constraints for ScRNA-seq data clustering. J Chem Inf Model, 2022; 62(23):6271–6286; doi: 10.1021/acs.jcim.2c01305

24.

Stuart

, Srivastava

, Madad

, et al. Single-cell chromatin state analysis with Signac. Nat Methods, 2021; 18(11):1333–1341; doi: 10.1038/s41592-021-01282-5

25.

, Zhang

, Ma

. Network-based integrative analysis of single-cell transcriptomic and epigenomic data for cell types. Brief Bioinform, 2022; 23(2); doi: 10.1093/bib/bbab546

26.

Zhu

, Yu

, Huang

, et al. An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nat Struct Mol Biol, 2019; 26(11):1063–1070; doi: 10.1038/s41594-019-0323-x