Deep Structure-Enhanced Cell Clustering Model for Single-Cell RNA Sequencing Data

Abstract

Recently, deep cell clustering, which employs deep neural networks to learn cell representation for clustering purposes, has attracted increasing research interests. Traditional deep cell clustering models for single-cell RNA sequencing data rely only on the cell’s internal features for learning the representation and suffer from the insufficient problem of representation learning. In this article, we introduce a deep structural enhanced network for cell clustering, namely, Deep Structure-Enhanced Cell Clustering (scDSEC). The scDSEC model uses the internal features of the cells as a foundation and enhances them by incorporating the external structural semantics of the cells. An integrated reinforcement enhancement strategy is designed, in which a complete cell representation, captured by fusing cell internal information and external information, and an enhanced cell internal representation, captured with the help of complete cell representation, are learned in a layer-by-layer reinforcement manner. Experimental results show that the scDSEC model outperforms various existing mainstream deep cell clustering algorithms in terms of performance.

Keywords

cell representation deep cell clustering single-cell RNA sequencing data structural representation

1. INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) technology (Yu et al., 2022, 2024) is a revolutionary advancement in the biomedical field in recent years, enabling high-throughput sequencing of gene expression at the single-cell level, thereby revealing complex biological processes and intercellular heterogeneity. Unlike traditional RNA sequencing, which can only obtain average expression information from a population of cells, scRNA-seq allows for in-depth analysis of the characteristics and functions of individual cells, showcasing immense potential in fields such as cancer research, immunology, developmental biology, and neuroscience. For instance, scRNA-seq can identify cell types within the tumor microenvironment, analyze the functional states of immune cells, and track the differentiation trajectories of cells during embryonic development. Clustering, as an essential step in analyzing scRNA-seq data, helps identify different cell types or states by grouping similar cells together. Therefore, researching clustering methods for scRNA-seq data not only aids in better understanding the diversity and complexity among cells but also holds broad application prospects and significant importance.

Early cell clustering methods, such as K-means (Hartigan and Wong, 1979) and hierarchical clustering (Nielsen and Nielsen, 2016), were able to capture patterns and structures within scRNA-seq data to some extent. In addition, graph-based methods such as (Leiden Traag et al. and Louvain Ghosh et al., 2018) clustering are commonly used. These methods construct a k-nearest neighbor (kNN) graph of cells based on gene expression similarities to identify clusters. Overall, these traditional approaches provide important tools for understanding cellular heterogeneity. However, traditional clustering methods perform poorly on the high-dimensional scRNA-seq data, are easily affected by the sparsity of scRNA-seq data, and are sensitive to noise and outliers, significantly diminishing the accuracy and stability of cell clustering. For example, noise may lead to incorrect assignment of cells to different clusters, while sparsity may obscure actual intercellular differences.

In recent years, with the development of deep learning, deep cell clustering methods (Chen et al., 2020; Tian et al., 2019) have gradually become a research hotspot due to their excellent clustering performance on high-dimensional and sparse scRNA-seq data. These methods refine clustering iteratively by using auxiliary target distributions to learn highly reliable assignments, resulting in better cell clustering outcomes. However, traditional deep cell clustering models typically rely solely on the inherent features of each cell data sample to learn its internal representation. This approach mainly considers the internal features of cells, but due to the limited gene length of each cell, the descriptions often only encompass part of the distinguishing features, potentially overlooking other key characteristics. Thus, relying solely on internal cell features can lead to incomplete cell representations, causing similar cells to be incorrectly deemed dissimilar due to differing feature sets. This incomplete cell representation makes it challenging for traditional methods to accurately reflect the true relationships between cells, thereby affecting the accuracy of deep cell clustering.

In real scRNA-seq data, in addition to the internal features of the cells, there is also useful external information that can be discovered outside of each cell. For instance, neighboring cells, which refer to cells similar in gene expression profiles, are considered external structural information that typically contains meaningful supplementary features missing from the internal features of the cells themselves. Therefore, it is beneficial to improve the internal representations of cells by utilizing such external structural information. Although some traditional clustering methods, such as Leiden, utilize KNN graphs to group similar cells, they primarily rely on these graphs for clustering assignments and do not explicitly leverage external structural information to enhance internal cell representations.

To address these issues, this article designs a Deep Structure-Enhanced Cell Clustering (scDSEC) model for scRNA-seq data, which employs a structural enhancement network to learn cell partitions. Specifically, the scDSEC model uses the internal features of the cells as a foundation and enhances them by incorporating the external structural semantics of the cells. For each cell, an autoencoder (AE) (Hinton and Salakhutdinov, 2006) is used to capture its internal representation by mapping the original high-dimensional internal cell features to a low-dimensional representation space. A graph convolutional network (GCN) (Kipf and Welling, 2017) model is employed to learn the external representation of each cell, as it has shown superior results in representing scRNA-seq data as graphs and encoding external structural information for each cell sample. Both the AE and GCN models can capture the latent information of cells at different levels through their multilayer network architectures.

To utilize external information to enhance internal representations of cells, a common strategy is to assume that each cell representation is independent, learning these different cell representations separately and directly combining these learning results through a conventional fusion process (Bai et al., 2021; Shi et al., 2020). However, in reality, the internal and external representations of cells are not independent but are interrelated. A well-learned external representation can be used to emphasize, supplement, and eliminate ambiguities in internal cell features, which in turn helps to improve internal cell representations. Therefore, utilizing external structural information to enhance internal representations during the learning process is beneficial. The scDSEC model designs an integrated reinforcement enhancement strategy for cell representation learning, capable of simultaneously learning enhanced internal cell representations and integrated cell representations layer by layer. By integrating the integrated representations learned from internal and external representations, it captures the complete semantic features of both the internal content and external structure of cells. Subsequently, enhanced internal cell representations are learned based on the integrated representations of cells. The enhanced internal representations and integrated representations of cells are reinforced layer by layer, allowing the cell information learned in the previous stage to be used to further enhance the subsequent representation learning process. Additionally, the scDSEC model also designs a joint objective function to simultaneously achieve the learning of cell cluster partitioning and optimize the cell representation learning network of the scDSEC model.

The main contributions of this article can be summarized as follows:

•
This article designs a Deep Structure-Enhanced Cell Clustering (scDSEC) model for scRNA-seq data, employing a structural enhancement network to learn cell partitions. Specifically, the scDSEC model utilizes the internal features of the cells as a foundation and enhances them using the external structural information of the cells.
•
This article proposes an integrated reinforcement enhancement strategy for cell representation learning. The strategy progressively fuses internal and external representations to form a more comprehensive and discriminative cell embedding. Enhanced internal representations are refined under the guidance of integrated representations, and both are jointly optimized in a layer-by-layer reinforcement manner to achieve more robust and biologically meaningful clustering results.
•
This article conducts extensive experiments on real scRNA-seq datasets, comparing the proposed scDSEC model with several state-of-the-art clustering models. The experimental results demonstrate that the scDSEC model is effective and significantly improves cell clustering performance.

2. RELATED WORK

2.1. Deep cell clustering methods based on internal cell representations

In recent years, deep cell clustering methods for analyzing scRNA-seq data have received widespread attention due to their powerful capabilities in cell representation learning. For example, scDeepCluster (Chen et al., 2020) combines Deep Clustering Network (Yang et al., 2017) and Deep Embedded Clustering (Xie et al., 2016) to simultaneously learn latent feature representations and clustering assignments. The model-based deep learning approach (scziDesk) (Tian et al., 2019) employs a weighted soft K-means algorithm in the latent space to achieve high-quality cell clustering. The deep multi-constraint soft clustering model for single-cell RNA-seq data via zero-inflated autoencoder embedding (scMCKC) (He et al., 2023) introduce cell-level compactness constraints by considering the associations among similar cells. The denoising adaptive deep clustering with self-attention mechanism (scDASFK) (Su et al., 2023) incorporates cell similarity information into clustering methods and uses deep denoising network modules to clean the data. Overall, these methods integrate feature representation learning and clustering assignments to enhance final clustering performance. However, they only consider gene expression information without explicitly describing the relationships and structural information between cells, neglecting the relationships among cells.

2.2. Deep cell clustering methods based on external structural representations

In recent years, deep graph clustering methods have significantly addressed the aforementioned issues by learning latent low-dimensional representations while considering the global structure of the entire graph. To this end, many researchers have proposed various deep graph embedding clustering methods to learn the structural information and associations between cells. For instance, the single-cell model-based deep graph embedding clustering (scTAG ) (Yu et al., 2022) and the deep graph clustering model for single-cell transcriptomics data based on unsupervised Gene-Cell Collective representation learning and Optimal Transport (scGCOT) (Yu et al., 2024) utilize graph neural networks to encode both the topological relationships between cells and the attributes of individual cells, thereby enhancing clustering accuracy. Overall, existing state-of-the-art graph embedding clustering methods extract cell associations to better understand gene expression. However, they only consider external structural information, neglecting the internal gene expression information of the cells themselves.

Recently, some studies have begun to incorporate both internal and external information to enhance the clustering process. The Structural Deep Clustering Network (SDCN) (Bo et al., 2020) integrates structural information into deep clustering by designing a transfer operator that passes the representations learned by an AE to corresponding GCN layers. Building upon SDCN, the Deep Fusion Clustering Network (DFCN) (Tu et al., 2021) further refines this approach by dynamically combining the structural representation to enrich the semantic representation at the final layer. Despite the success of these models, both SDCN and DFCN are not ideally suited for cell clustering tasks. This limitation arises because these models were developed primarily to leverage GCN-based structural information for clustering. Consequently, the structural representation derived from GCNs, which often includes significant noise for cell data, serves as the base representation. Additionally, the importance of gene features, which has been well recognized in prior literature, is not fully explored and only plays a supplementary role in the SDCN and DFCN models. As a result, these models are more effective for tasks that emphasize structural data processing. In contrast, the scDSEC model proposed in this article improves cell clustering performance by using the internal features of cells as the primary foundation and incorporating external structural information to enhance clustering outcomes.

3. PROPOSED MODEL

The overall framework of the proposed scDSEC model consists of four modules: The cell data processor, the structurally enhanced internal cell representation encoder, the Zero-Inflated Negative Binomial (ZINB)-based cell representation decoder, and the cell clustering module. First, for the scRNA-seq dataset, the scDSEC model normalizes the scRNA-seq data through the cell data processor and applies PCA for dimensionality reduction on the processed data. The structural-enhanced internal cell representation encoder uses external structural information to enhance the internal representations of cells at different levels, layer by layer, through an integrated reinforcement enhancement strategy. Next, the ZINB-based cell representation decoder takes the enhanced internal representations outputted by the internal cell representation enhancement learning module as input, reconstructs the cell data based on the ZINB (Miao et al., 2018; Risso et al., 2018) distribution, and reconstructs the cell structure graph using an inner product decoder. This design choice is motivated by previous findings that the distribution of scRNA-seq gene expression counts can be well approximated by a ZINB model (Yu et al., 2022). Finally, the cell clustering module optimizes the learning of enhanced cell representations through a joint objective function and performs the clustering of cell groups. The overall structure of the model is illustrated in Figure 1, which will be detailed in the following sections covering the four modules.

FIG. 1.

The framework of the proposed model (scDSEC). scDSEC, Deep Structure-Enhanced Cell Clustering.

3.1. Cell data processor

The cell data processor takes the scRNA-seq gene expression matrix as input, where X_ij represents the expression count of gene j (1 ≤ j ≤ O) in cell i (1 ≤ i ≤ N). The first step is to filter out genes that are expressed in less than 1% of the cells or not expressed at all. Considering that the count scRNA-seq data in the matrix is discrete and varies greatly in size, the normalization is defined as follows:

N (X_{i j}) = \ln (m (X) \frac{X_{i j}}{\sum_{o} X_{i o}}),

(1)

where m (X) represents the median of the total expression values of the cells. According to Equation (1), the discrete scRNA-seq data are smoothed and rescaled by a natural logarithmic transformation. After normalization, the scDSEC model selects the top t highly variable genes ranked by normalized dispersion values calculated using the Scanpy package, identifying genes with high information content.

The scDSEC model constructs the cell structure graph A from the normalized expression matrix of the selected top t highly variable genes. To improve computational efficiency and reduce noise, PCA is first applied to obtain a lower-dimensional representation of the cells. Euclidean distance is then used to compute pairwise similarities between cells in this feature space, and the KNN algorithm is applied to identify the k nearest neighbors for each cell. Specifically, if node a is among the k nearest neighbors of node b, an edge is established between them. The resulting graph A represents the cell–cell neighborhood structure used in subsequent graph learning.

3.2. Structural-enhanced internal cell representation encoder

Due to the outstanding performance of graph convolutional neural networks (GCNs) in structural representation learning, the scDSEC model utilizes a multilayer GCN to learn external structural representations of cells in the structural-enhanced internal cell representation encoder. The specific design of the external structural representation learning network is as follows:

Z^{(l)} = σ_{g} ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} Z^{(l - 1)} W^{(l)})

(2)

\tilde{A} = A + I

(3)

{\tilde{D}}_{i i} = \sum_{t = 1}^{N} {\tilde{A}}_{i t}

(4)

where

σ_{g}

is the activation function, W⁽ ^l ⁾ is the weight matrix of the l-th encoding layer, j is the index of the cell samples, and N is the number of cells. I denotes the identity diagonal matrix of A, and

\tilde{D}

is the degree matrix of

\tilde{A}

, following standard GCN formulation as in SDCN (Bo et al., 2020) and other classical methods. This design ensures stable feature propagation and preserves self-information for each node. It is worth noting that the input to the first layer of the GCN is the cell data X.

To learn enhanced internal representations of cells, the scDSEC model designs an integrated reinforcement enhancement strategy, which utilizes external structural information to enhance internal representations during the learning process, capable of simultaneously learning enhanced internal cell representations and integrated cell representations layer by layer. By integrating the integrated representations learned from internal and external representations, it captures the complete features of both internal content and external structure of cells. Subsequently, enhanced internal cell representations are learned based on the integrated representations of cells. The enhanced internal representations and integrated representations of cells are reinforced layer by layer, allowing the cell information learned in the previous stage to be used to further enhance the subsequent representation learning process. The specific design of the cell internal representation learning network is as follows:

H^{(l)} = σ_{e} (W_{e}^{(l)} F^{(l - 1)} + b_{e}^{(l)})

(5)

F^{(l)} = α H^{(l)} + (1 - α) Z^{(l)}

(6)

where σ is the activation function, and

W_{e}^{(l)}

and

b_{e}^{(l)}

are the weight matrix and bias matrix of the l-th layer in the encoder, respectively.

F^{(l)}

denotes the integrated representations of cell data. α is a balance coefficient factor that adjusts the contribution of the integrated representations

F^{(l)}

and the structural-enhanced cell internal representation

H^{(l)}

. Since the input of the first layer is the cell data X, F⁽⁰⁾ = X. The encoding of the last layer (i.e., the L-th layer) in this module, representing the output of the structural-enhanced cell internal representation, can be expressed as follows:

H^{(L)} = F^{(L)}

(7)

3.3. ZINB-based cell representation decoder

To better guide the learning of cell representations, the scDSEC model designs a ZINB-based cell representation decoder. As shown in Figure 1, the ZINB-based cell representation decoder first decodes the enhanced cell internal representations H⁽ ^L ⁾ output by the cell internal representation enhancement learning module, aiming to reconstruct both the cell data X and the cell structure graph A. The specific network structure is as follows:

D^{(l)} = σ_{d} (W_{d}^{(l)} D^{(l - 1)} + b_{d}^{(l)})

(8)

A^{'} = σ (H^{(L)} H^{{(L)}^{T}})

(9)

where σ and σ_d are the activation functions, and

W_{d}^{(l)}

and

b_{d}^{(l)}

are the weight matrix and bias matrix in the l-th layer of the decoding module, respectively. Since the input of the first layer in this module is the enhanced internal cell representation H⁽ ^L ⁾, the first decoding layer can be expressed as follows:

D^{(1)} = σ_{d} (W_{d}^{(1)} H^{(L)} + b_{d}^{(1)})

(10)

The last layer of the decoding network can be expressed as follows:

X^{'} = σ_{d} (W_{d}^{(L)} D^{(L - 1)} + b_{d}^{(L)})

(11)

where

X^{'}

is the reconstructed cell data.

Given that scRNA-seq gene expression count matrices typically exhibit three characteristics, namely, (1) discreteness, (2) variance greater than the mean, and (3) many zero entries, these features can be modeled by a ZINB distribution. Therefore, to preserve the global structure of cell data X, the scDSEC model utilizes ZINB distribution as the target, combining it with the reconstructed cell data to better capture the structure of scRNA-seq data. Formally, ZINB is represented by the mean parameter μ and the dispersion parameter θ of the negative binomial distribution, along with an additional coefficient π (representing the probability weight of dropout events in scRNA-seq data):

ZINB (X | μ, θ, π) = π δ_{0} (X) + (1 - π) N B (X | μ, θ)

(12)

NB (X | μ, θ) = \frac{Γ (X + θ)}{X! Γ (θ)} {(\frac{θ}{θ + μ})}^{θ} {(\frac{μ}{θ + μ})}^{X}

(13)

where μ and θ represent the mean and dispersion of the negative binomial distribution, and π represents the weight of the zero part. To capture the structural characteristics of scRNA-seq data, we assume that the reconstructed

X^{'}

also follows a ZINB distribution. The three parameters μ, θ, and π of the ZINB distribution are estimated through three parallel fully connected output layers:

μ = \exp (W_{μ} X^{'})

(14)

θ = \exp (W_{θ} X^{'})

(15)

π = sigmoid (W_{π} X^{'})

(16)

where W_μ, W_θ, and W_π are the respective weights of the parameter estimators. The choice of activation function depends on the range and definition of the parameters: Since π represents the weight of zeros, its value ranges between [0, 1], so we use the sigmoid function to estimate it. Given the nonnegative nature of parameters μ and θ, we use the exponential function to estimate them.

3.4. Cell clustering module

The cell clustering module involves two tasks. The first task is to perform cell clustering using the enhanced internal cell representations H⁽ ^L ⁾ output by the structural-enhanced internal cell representation encoder. The second task is to optimize the cell representation learning network in the scDSEC model. Thus, the cell clustering module designs a joint objective function to accomplish both tasks simultaneously. This objective function incorporates four guiding pieces of information: clustering loss, the reconstruction loss of cell data, the reconstruction loss of cell data based on ZINB, and the reconstruction loss of the cell structure graph.

To learn the cell cluster assignments, given the i-th cell sample and the j-th cluster center, we use the Student’s t-distribution Van der Maaten and Hinton, 2008 to measure the similarity between the enhanced internal cell representation H⁽ ^L ⁾ and the cluster center v_j and let Q = q_ij represent the distribution of the cell clustering result. Therefore, similar to the approach in the literature, the clustering loss of the cells is calculated as follows:

L_{clu} = K L (P ‖ Q)

(17)

where Q and P are the clustering result distribution and target distribution, respectively.

The reconstruction loss of the cell data based on ZINB is calculated using the negative log-likelihood of the ZINB distribution, allowing the model to learn neural network layer parameters that best capture the scRNA-seq data’s characteristics. Thus, the loss function for this part is defined as follows:

L_{ZINB} = - \log ZINB (X | μ, θ, π)

(18)

The reconstruction loss of the cell representation is calculated as follows:

L_{re} = MSE (X, X^{'})

(19)

where MSE (X, X') is mean squared error.

The reconstruction loss of the cell structure graph is calculated as follows:

L_{graph} = BCE (A, A^{'})

(20)

where BCE (A, A') is cross-entropy.

Combining the above losses, the overall loss function of the scDSEC model is expressed as follows:

L_{total} = L_{re} + β_{1} L_{clu} + β_{2} L_{ZINB} + β_{3} L_{graph}

(21)

where β₁, β₂, and β₃ are hyperparameters to balance the clustering loss, the reconstruction loss of cell data based on ZINB, and the reconstruction loss of the cell structure graph. The training process of the scDSEC model is shown in Algorithm 1. Notably, the maximum number of iterations M is set based on preliminary experiments to ensure convergence. In practice, the model typically converges well before reaching M iterations.

4. EXPERIMENTS

4.1. Experimental datasets

We perform our experiments using four real scRNA-seq datasets. These datasets include a variety of species, such as mice and humans, covering different types of cells, including lung cells and trachea cells. The statistics for these datasets are presented in Table 1.

Table 1.
The Statistics of the Single-Cell RNA Sequencing -Seq Datasets

Dataset Cell Gene Class Reference

Muraro 2122 19046 9 (Muraro et al., 2016)

QS Lung 1676 23341 11 (Schaum et al., 2018)

QS Diaphragm 870 23341 5 (Schaum et al., 2018)

QS Trachea 1350 19992 4 (Schaum et al., (2018)

Dataset	Cell	Gene	Class	Reference
Muraro	2122	19046	9	(Muraro et al., 2016)
QS Lung	1676	23341	11	(Schaum et al., 2018)
QS Diaphragm	870	23341	5	(Schaum et al., 2018)
QS Trachea	1350	19992	4	(Schaum et al., (2018)

4.2. Baseline methods

We selected six state-of-the-art methods for comparison. Those compared baselines can be categorized into four groups as follows: (1) Traditional cell clustering method: Leiden (standard Scanpy pipeline); (2) deep cell clustering methods via internal representation learning: scEMC (Clustering method via skip aggregation network) (Hu et al., 2024); scMCKC (He et al., 2023), scDASFK (Su et al., 2023), and scVI (Deep generative modeling for single-cell transcriptomics) (Lopez et al., 2018); (3) deep cell clustering methods via external representation learning: scGCOT (Yu et al., 2024) and scTAG (Yu et al., 2022); and (4) deep clustering methods via internal and external representation learning: DFCN (Tu et al., 2021) and SDCN (Bo et al., 2020).

4.3. Evaluation metrics and experimental setup

To assess the performance of the clustering tasks, this chapter employs two widely used evaluation metrics: Normalized Mutual Information (NMI) (Estévez et al., 2009) and Average Rand Index (ARI) (Xia et al., 2014). Higher values for each metric indicate superior performance.

All experiments were conducted using the PyTorch platform and an NVIDIA 4060 TI GPU. In the proposed scDSEC, we set the dimension of the latent layers to d-500-500-2000-10, with the nearest neighbor parameter E = 50 and the number of highly variable genes set to 1000. β₁, β₂, and β₃ were set to 1. α was set to 0.5. The model was trained for 50, and optimization was performed using the Adam algorithm with a learning rate of 1e-5. For all baseline methods, we used their official implementations with default parameters unless otherwise specified. All methods were executed under the same computational environment to ensure fair comparison. To account for variability, we report the average results over 10 independent runs.

4.4. Analysis of clustering results

To evaluate the effectiveness of the scDSEC model, this experiment compares and analyzes its clustering results against several baseline models across four scRNA-seq datasets. The clustering results are summarized in Table 2, with the following key observations:

Table 2.
Clustering Results of Compared Experiment

QS_trachea Muraro QS_diaphragm QS_lung

Methods NMI ARI NMI ARI NMI ARI NMI ARI

Leiden 55.74 20.68 79.88 65.26 79.43 67.94 72.20 35.25

scVI 53.88 20.30 73.89 46.87 71.08 42.48 70.73 31.93

scEMC 68.23 60.36 82.95 86.07 72.00 66.54 79.78 83.68

scDASFK 8.05 4.51 70.95 62.67 39.38 28.29 39.57 17.86

scGCOT 69.64 63.76 82.04 87.57 91.41 93.91 79.92 72.17

scTAG 52.85 66.52 78.48 74.28 93.94 96.81 79.79 66.45

SDCN 45.55 68.58 54.11 60.40 19.30 14.94 32.74 20.13

DFCN 50.26 41.80 81.98 68.38 94.06 95.62 68.50 49.39

scDSEC 74.73 81.92 87.85 92.63 95.17 97.13 80.38 85.58

	QS_trachea	Muraro	QS_diaphragm	QS_lung
Leiden	55.74	20.68	79.88	65.26	79.43	67.94	72.20	35.25
scVI	53.88	20.30	73.89	46.87	71.08	42.48	70.73	31.93
scEMC	68.23	60.36	82.95	86.07	72.00	66.54	79.78	83.68
scDASFK	8.05	4.51	70.95	62.67	39.38	28.29	39.57	17.86
scGCOT	69.64	63.76	82.04	87.57	91.41	93.91	79.92	72.17
scTAG	52.85	66.52	78.48	74.28	93.94	96.81	79.79	66.45
SDCN	45.55	68.58	54.11	60.40	19.30	14.94	32.74	20.13
DFCN	50.26	41.80	81.98	68.38	94.06	95.62	68.50	49.39
scDSEC	74.73	81.92	87.85	92.63	95.17	97.13	80.38	85.58

The bold values indicate the best performance.

•

For each evaluation metric, scDSEC consistently outperformed all baseline models across all scRNA-seq datasets, achieving the highest clustering performance. These results strongly validate the effectiveness of the scDSEC model. The findings indicate that scDSEC successfully learns structurally enhanced internal representations, ensuring that the cell representations effectively capture both deep features and structural information, which in turn improves the accuracy and performance of cell clustering.

•

A comparison of various methods reveals that deep cell clustering techniques using external representation learning generally outperform those relying on internal representation learning. This advantage stems from the fact that internal representation learning methods depend solely on scRNA-seq data features, overlooking the richer graph representations learned through cell graphs and failing to incorporate external structural information. In contrast, external representation learning methods effectively capture more comprehensive graph-based information. Consequently, deep cell clustering methods using external representation learning demonstrate superior performance when dealing with complex scRNA-seq data, offering a more accurate representation of the underlying cell structure.

•

Additionally, deep clustering methods that leverage both internal and external representation learning typically perform well across various scRNA-seq datasets. This is because these methods can effectively learn both internal and external representations, leading to improved clustering outcomes. The scDSEC model outperforms both the DFCN and SDCN models due to its ability to learn structurally enhanced representations of cells, which integrate the structural information into the internal representation of the scRNA-seq data, resulting in higher clustering accuracy. Furthermore, scDSEC employs a reconstruction loss based on ZINB, which is particularly well-suited for modeling scRNA-seq data.

4.5. Ablation analysis

4.5. 1 Architecture ablation

To investigate the contributions of the major architectural components in scDSEC, we conducted a series of architecture ablation experiments by removing or modifying specific modules. The following variants were evaluated: (1) removing the reinforcement enhancement mechanism and replacing the layer-wise fusion strategy with a single-layer fusion (w/o RE), (2) removing the AE encoder component (w/o AE), and (3) removing the GCN component (w/o GCN).

The quantitative results on four datasets are summarized in Figure 2. It can be observed that removing any component leads to a degradation in clustering performance compared with the scDSEC model, indicating that each module contributes positively to the overall framework. Interestingly, removing the AE encoder leads to the worst performance on all datasets. This observation indicates that the latent representations learned by the AE encoder contain richer and more discriminative biological information.

FIG. 2.

Architecture ablation results for scDSEC on four datasets.

4.5. 2 Analysis of the joint objective function

In this experiment, we performed auxiliary tests across all datasets to evaluate the individual contributions of each loss function to overall clustering performance. Specifically, AVG represents the average performance across all datasets, $L_{re}$ represents the reconstruction loss of cell data, $L_{clu}$ represents the clustering loss, $L_{ZINB}$ represents the reconstruction loss of cell data using ZINB, and $L_{graph}$ represents the reconstruction loss of the cell structure graph. The results are summarized in Table 3. From these results, we can observe the following:

Table 3.
Clustering Results of Different Loss Function

Methods AVG

$L_{re}$ $L_{graph}$ $L_{clu}$ $L_{ZINB}$ NMI ARI

✓ ✗ ✗ ✗ 78.77 77.21

✗ ✓ ✗ ✗ 77.87 73.75

✗ ✗ ✓ ✗ 76.57 71.27

✗ ✗ ✗ ✓ 75.61 71.43

✓ ✓ ✗ ✗ 76.44 69.73

✓ ✓ ✓ ✗ 78.34 74.99

✓ ✓ ✓ ✓ 84.53 89.32

Methods	AVG
✓	✗	✗	✗	78.77	77.21
✗	✓	✗	✗	77.87	73.75
✗	✗	✓	✗	76.57	71.27
✗	✗	✗	✓	75.61	71.43
✓	✓	✗	✗	76.44	69.73
✓	✓	✓	✗	78.34	74.99
✓	✓	✓	✓	84.53	89.32

The bold values indicate the best performance.

•

The performance of scDSEC surpasses that of all its variants, attributed to its more effective clustering optimization achieved through the joint objective function. The interaction among the four loss functions within this joint objective function enhances their collective impact, leading to a significant improvement in clustering performance. This underscores the effectiveness of employing a joint objective function.

•

Moreover, without the reconstruction loss of cell data using ZINB $L$ ZINB, scDSEC does not achieve high clustering accuracy. This is because the ZINB model is particularly well-suited for scRNA-seq data, which is known for its high sparsity, dropout events, and overdispersion, all of which the ZINB model effectively manages. This highlights the importance of the reconstruction loss using ZINB $L$ ZINB in further enhancing clustering performance.

4.6. Analysis of the number of different genes

In single-cell data analysis, highly variable genes are essential for capturing valuable biological insights and identifying cell types. To evaluate the impact of the number of selected highly variable genes, we applied scDSEC to real scRNA-seq datasets, varying the gene counts from 300 to 10,000. Figure 3 presents the NMI and ARI values for all datasets, using 300, 500, 1000, 2000, 5000, and 10,000 highly variable genes, respectively. The results demonstrate that selecting 1000 highly variable genes yields the highest performance, while using only 300 genes results in significantly poorer outcomes. Additionally, as the number of selected genes exceeds 1000, performance shows a slight decline, likely due to the noise introduced by incorporating additional genes. Consequently, to optimize computational resources, we set the number of selected highly variable genes in the model to 1000.

FIG. 3.

(a) The clustering results of the NMI values with different number of genes. (b) The clustering results of the ARI values with different number of genes. ARI, Average Rand Index; NMI, Normalized Mutual Information.

4.7. Analysis of the number of neighbors

The number of neighbors K is a critical parameter in constructing the KNN graph to capture external relationships between cells. To evaluate the impact of K on clustering performance, we varied K ∈ {5, 10, 30, 50, 100, 200} for all scRNA-seq datasets. Figure 4 presents the NMI and ARI values on all datasets for the six cases with the scDSEC. From Figure 4, the cell clustering performance improves rapidly with an increase in K up to a certain point, where it reaches an optimal value. However, when K becomes excessively large, the performance begins to decline slightly. This degradation may be attributed to the introduction of noise and redundant information from distant or less relevant neighbors, which can blur cluster boundaries and weaken feature discrimination. To maintain optimal clustering performance, we set K within a reasonable range for all experiments.

FIG. 4.

(a) The clustering results of the NMI values with different number of neighbors K. (b) The clustering results of the ARI values with different number of neighbors K.

4.8. Analysis of the parameter α

To study the effect of balance coefficient α, we set α = {0.0, 0.1, 0.3, 0.5, 0.7, 0.9.1.0} on all scRNA-seq datasets in detail. Figure 5 tabulates the average NMI and ARI values on all scRNA-seq datasets for the seven cases with the scDSEC. Note that α = 0 means that the learned cell representation only contains the external representation, and α = 1 represents that the scDSEC only uses the internal representation.

FIG. 5.

The clustering results of the average NMI and ARI values with the parameter α.

From Figure 5, all metrics reach their highest performance when the parameter α = 0.5, indicating that both internal and external representations are equally critical for optimal results. This suggests that the improvements in scDSEC largely stem from enhanced cell representation. Another interesting observation is that when α = 0.1 and α = 0.9, even a slight fusion of internal or external representations improves cell clustering performance and helps mitigate the oversmoothing issue.

4.9. Biological validation

To further assess the biological relevance and interpretability of the representations learned by scDSEC, we conducted a qualitative biological validation on multiple representative scRNA-seq datasets. Specifically, we compared the two-dimensional visualizations of cells in both the raw expression space and the scDSEC representation space, with each point colored according to its ground-truth biological cell type annotation. As illustrated in Figure 6, the left panels show the cell distributions in the raw expression space, while the right panels present those in the corresponding scDSEC representation space.

FIG. 6.

Biological Validation of scDSEC representations across multiple datasets. For each dataset, the left panel shows the t-SNE visualization of cells in the raw expression space, while the right panel shows the t-SNE visualization of the corresponding scDSEC representation space, with points colored by ground-truth biological cell type annotations.

Across all datasets (Muraro, QS Trachea, QS Lung, and QS Diaphragm), the embeddings generated by scDSEC exhibit substantially clearer separation and more compact clustering boundaries compared with the raw expression space. For instance, distinct pancreatic cell types such as alpha, beta, and delta cells in the Muraro dataset, as well as epithelial, endothelial, and mesenchymal cell populations in the QS Trachea datasets, are well separated in the scDSEC representation space. These results demonstrate that scDSEC effectively captures biologically meaningful structures and enhances the discriminability among different cell types, validating the biological interpretability of the learned representations.

To quantify these improvements, we applied k-means clustering to both the raw expression space and the scDSEC representation space and computed the Average Rand Index (ARI) and Silhouette scores. As shown in Table 4, the scDSEC representations consistently achieve higher ARI and Silhouette scores across all datasets. These results indicate that scDSEC not only improves clustering accuracy with respect to ground-truth cell types but also enhances cluster compactness and separation. Together with Figure 6, these findings demonstrate that scDSEC effectively captures biologically meaningful structures and improves the interpretability of the learned representations.

Table 4.

Quantitative Evaluation of Deep Structure-Enhanced Cell Clustering Representations Using Average Rand Index and Silhouette Scores

	ARI (%)		Silhouette (%)
Dataset	Raw	scDSEC	Raw	scDSEC
Muraro	73.89	92.83	20.29	93.40
QS trachea	58.57	87.22	10.35	84.16
QS lung	61.28	68.06	13.47	83.33
QS diaphragm	92.09	97.66	14.14	97.78

4.10. Analysis of scalability and efficiency

With the rapid advancement of scRNA-seq technologies, the number of sequenced cells has grown dramatically, giving rise to large-scale datasets that pose significant computational challenges. Therefore, assessing the scalability and computational efficiency of clustering algorithms is crucial for evaluating their practical applicability to modern single-cell studies.

To this end, we extended our scalability analysis of scDSEC by incorporating two large-scale scRNA-seq datasets: Cao 2020 Intestine and Cao 2020 Pancreas, which contain approximately 51,650 and 45,653 cells annotated with 12 and 14 clusters, respectively. During experimentation, we found that among the base-line methods, only Leiden, scVI, and scDSEC could be successfully executed on these large-scale datasets within our computational environment. Other deep learning-based clustering methods either failed to converge or encountered out-of-memory errors when processing datasets larger than 40k cells. Figure 7 presents the clustering performance (NMI and ARI), the runtime and memory consumption of scDSEC, Leiden, and scVI on these two large datasets.

FIG. 7.

Comparison of clustering performance and efficiency on large-scale scRNA-seq datasets. scRNA-seq, single-cell RNA sequencing.

The results demonstrate that scDSEC achieves the best overall clustering performance, maintaining both high NMI and ARI values across datasets while preserving computational feasibility. Although scDSEC introduces a moderate increase in memory usage due to its dual-representation enhancement mechanism, it remains significantly faster than scVI and well within acceptable computational limits. These findings confirm that scDSEC effectively scales to large single-cell datasets while delivering superior clustering accuracy and reasonable computational cost.

4.11. Analysis of robustness and generalization

To demonstrate the robustness and generalization of the proposed scDSEC model, we conducted additional experiments on several widely used benchmark datasets, including PBMC and Baron (Table 5). The Baron dataset contains pancreatic cells collected from four human donors and two mouse donors. These samples were grouped by species to construct two separate datasets, namely, Baron (human) and Baron (mouse), allowing us to evaluate model robustness across different biological sources. Since the PBMC dataset is large, and under our experimental environment, only scVI and Leiden can be executed on the full dataset. To enable fair comparisons with all baseline methods, we randomly sampled 10% of the cells from each cell type for our experiments.

Table 5.
Summary of Benchmark Datasets

Dataset Cells Gene Class Donors Reference

PBMC 6857 20,387 10 — (Min et al., 2024)

Baron (human) 8569 13,808 14 4 (Liu et al., 2024)

Baron (mouse) 1886 12,629 13 2 (Liu et al., 2024)

Dataset	Cells	Gene	Class	Donors	Reference
PBMC	6857	20,387	10	—	(Min et al., 2024)
Baron (human)	8569	13,808	14	4	(Liu et al., 2024)
Baron (mouse)	1886	12,629	13	2	(Liu et al., 2024)

Table 6 reports the clustering performance of different methods on these datasets in terms of ARI and NMI. The results demonstrate that scDSEC consistently achieves competitive or superior clustering performance compared with baseline methods across the additional benchmark datasets, confirming its robustness and generalization capability.

Table 6.

Clustering Results on Benchmark Datasets

Methods	PBMC		Baron_human		Baron_mouse
Methods	NMI	ARI	NMI	ARI	NMI	ARI
Leiden	54.40	30.70	79.34	55.56	72.39	44.99
scVI	51.72	20.89	64.22	23.71	64.83	22.83
scEMC	47.12	31.03	76.68	79.15	68.30	63.70
scDASFK	35.00	32.40	55.36	41.84	50.74	32.66
scGCOT	57.66	39.88	63.94	41.94	70.83	72.13
scTAG	17.97	14.46	44.17	19.12	47.52	26.81
SDCN	21.86	6.77	37.30	27.45	29.20	10.82
DFCN	60.70	41.46	76.15	85.29	72.33	71.47
scDSEC	67.77	52.93	80.35	86.53	73.87	82.25

The bold values indicate the best performance.

To further illustrate the clustering structure and batch mixing characteristics, we visualized the clustering results on the Baron datasets using t-SNE. As shown in Figure 8, cells belonging to the same cell type form compact clusters, while cells from different donors are well mixed. These results indicate that the proposed method effectively captures the underlying biological structure while mitigating batch effects across different donors.

FIG. 8.

Visualization of clustering results on the Baron datasets.

5. CONCLUSION

This article introduces a novel deep structure-enhanced cell clustering model for scRNA-seq data, named scDSEC. The model uses the internal features of cells as its foundation and enhances these features by integrating external structural information. A unique reinforcement enhancement strategy is employed, wherein a comprehensive cell representation-formed by merging internal and external information-and an improved internal cell representation, refined with the aid of the complete cell representation, are learned in a progressive, layer-by-layer manner. Experimental results on multiple scRNA-seq datasets validate the effectiveness of scDSEC.

An interesting direction for future research involves exploring how the scDSEC model can be adapted to handle noisy cell features, which remains a significant challenge in cell analysis. In addition, we plan to extend our framework to automatically determine the optimal number of clusters in a data-driven manner, which could further improve the model’s robustness and biological interpretability. Moreover, investigating alternative distributional assumptions is an interesting and promising direction for future research, as it may reveal how different count models affect clustering performance and generalization.

AUTHORS’ CONTRIBUTIONS

M/Y.: Investigation, resources, writing—original draft, and funding acquisition. L.R.: Validation, conceptualization, methodology, writing—review and editing, funding acquisition, and supervision. Both authors have read and approved the final article.

Footnotes

ACKNOWLEDGMENTS

This work is supported by Guizhou Province Basic Research Program (Natural Science) General Project No. MS[2025]338, Project of Guizhou Light Industry Technical College No. 25QY17.

DISCLOSURE STATEMENT

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

FUNDING INFORMATION

Guizhou Province Basic Research Program (Natural Science) General Project (No. MS[2025]338); Project of Guizhou Light Industry Technical College (No. 25QY17).

References

Bai

, Huang

, Chen

, et al. Deep multi-view document clustering with enhanced semantic embedding. Information Sciences 2021;564:273–287.

, Wang

, Shi

, et al. Structural deep clustering network. In Proceedings of The Web Conference 2020, 1400–1410, 2020.

Chen

, Wang

, Zhai

, et al. Deep soft k-means clustering with self-training for single-cell rna sequence data. NAR Genom Bioinform 2020;2(2):lqaa039.

Estévez

, Tesmer

, Perez

, et al. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009;20(2):189–201.

Ghosh

, Halappanavar

, Tumeo

, et al. Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 885–895, 2018. doi: 10.1109/IPDPS.2018.00098

Hartigan

, Wong

. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. series c (Applied Statistics) 1979;28(1):100–108.

, Chen

, Tu

, et al. Deep multi-constraint soft clustering analysis for single-cell rna-seq data via zero-inflated autoencoder embedding. IEEE/ACM Trans Comput Biol Bioinform 2023;20(3):2254–2265.

Hinton

, Salakhutdinov

. Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504–507.

, Liang

, Dong

, et al. Effective multi-modal clustering method via skip aggregation network for parallel scrna-seq and scatac-seq data. Brief Bioinform 2024;25(2):bbae102.

10.

Kipf

, Welling

. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations , 2017.

11.

Liu

, Liang

, Wang

, et al. sclega: An attention-based deep clustering method with a tendency for low expression of genes on single-cell rna-seq data. Brief Bioinform 2024;25(5):bbae371.

12.

Lopez

, Regier

, Cole

, et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15(12):1053–1058.

13.

Maaten

, Hinton

. Visualizing data using t-sne. Journal of Machine Learning Research 2008;9(11).

14.

Miao

, Deng

, Wang

, et al. Desingle for detecting three types of differential expression in single-cell rna-seq data. Bioinformatics 2018;34(18):3223–3224.

15.

Min

, Wang

, Zhu

, et al. scasdc: Attention enhanced structural deep clustering for single-cell rna-seq data. 2024.

16.

Muraro

, Dharmadhikari

, Grün

, et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst 2016;3(4):385–394.e3.

17.

Nielsen

, Nielsen

. Hierarchical clustering. Introduction to HPC with MPI for Data Science, 195–211, 2016.

18.

Risso

, Perraudeau

, Gribkova

, et al. A general and flexible method for signal extraction from single-cell rna-seq data. Nat Commun 2018;9(1):284.

19.

Schaum

, Karkanias

, Neff

, et al. Single-cell transcriptomics of 20 mouse organs creates a tabula muris: The tabula muris consortium. Nature 2018;562(7727):367.

20.

Shi

, Nie

, Wang

, et al. Auto-weighted multi-view clustering via spectral embedding. Neurocomputing (Amst) 2020;399:369–379.

21.

, Lin

, Wang

, et al. Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data. Brief Bioinform 2023;24(2):bbad021.

22.

Tian

, Wan

, Song

, et al. Clustering single-cell rna-seq data with a model-based deep learning approach. Nat Mach Intell 2019;1(4):191–198.

23.

, Zhou

, Liu

, et al. Deep fusion clustering network. In Proceedings of the AAAI Conference on Artificial Intelligence 2021;35(11):9978–9987.

24.

Xia

, Pan

, Du

, et al. Robust multi-view spectral clustering via low-rank and sparse decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence 2014;28(1).

25.

Xie

, Girshick

, Farhadi

. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, 478–487. PMLR, 2016.

26.

Yang

, Fu

, Sidiropoulos

, et al. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In international conference on machine learning, 3861–3870. PMLR, 2017.

27.

, Chen

, Gao

, et al. Unsupervised gene-cell collective representation learning with optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence 2024;38(1):356–364.

28.

, Lu

, Wang

, et al. Zinb-based graph embedding autoencoder for single-cell rna-seq interpretations. In. AAAI 2022;36(4):4671–4679.

Methods				AVG
$L_{re}$	$L_{graph}$	$L_{clu}$	$L_{ZINB}$	NMI	ARI
✓	✗	✗	✗	78.77	77.21
✗	✓	✗	✗	77.87	73.75
✗	✗	✓	✗	76.57	71.27
✗	✗	✗	✓	75.61	71.43
✓	✓	✗	✗	76.44	69.73
✓	✓	✓	✗	78.34	74.99
✓	✓	✓	✓	84.53	89.32

Deep Structure-Enhanced Cell Clustering Model for Single-Cell RNA Sequencing Data

Abstract

Keywords

1. INTRODUCTION

2.1. Deep cell clustering methods based on internal cell representations

2.2. Deep cell clustering methods based on external structural representations

3. PROPOSED MODEL

4.1. Experimental datasets

Table 1. The Statistics of the Single-Cell RNA Sequencing -Seq Datasets Dataset Cell Gene Class Reference Muraro 2122 19046 9 (Muraro et al., 2016) QS Lung 1676 23341 11 (Schaum et al., 2018) QS Diaphragm 870 23341 5 (Schaum et al., 2018) QS Trachea 1350 19992 4 (Schaum et al., (2018)

4.3. Evaluation metrics and experimental setup

4.4. Analysis of clustering results

4.5. 1 Architecture ablation

Table 3. Clustering Results of Different Loss Function Methods AVG L re L graph L clu L ZINB NMI ARI ✓ ✗ ✗ ✗ 78.77 77.21 ✗ ✓ ✗ ✗ 77.87 73.75 ✗ ✗ ✓ ✗ 76.57 71.27 ✗ ✗ ✗ ✓ 75.61 71.43 ✓ ✓ ✗ ✗ 76.44 69.73 ✓ ✓ ✓ ✗ 78.34 74.99 ✓ ✓ ✓ ✓ 84.53 89.32

Table 5. Summary of Benchmark Datasets Dataset Cells Gene Class Donors Reference PBMC 6857 20,387 10 — (Min et al., 2024) Baron (human) 8569 13,808 14 4 (Liu et al., 2024) Baron (mouse) 1886 12,629 13 2 (Liu et al., 2024)

AUTHORS’ CONTRIBUTIONS

Footnotes

ACKNOWLEDGMENTS

DISCLOSURE STATEMENT

FUNDING INFORMATION

References

Table 1.
The Statistics of the Single-Cell RNA Sequencing -Seq Datasets

Dataset Cell Gene Class Reference

Muraro 2122 19046 9 (Muraro et al., 2016)

QS Lung 1676 23341 11 (Schaum et al., 2018)

QS Diaphragm 870 23341 5 (Schaum et al., 2018)

QS Trachea 1350 19992 4 (Schaum et al., (2018)

Table 5.
Summary of Benchmark Datasets

Dataset Cells Gene Class Donors Reference

PBMC 6857 20,387 10 — (Min et al., 2024)

Baron (human) 8569 13,808 14 4 (Liu et al., 2024)

Baron (mouse) 1886 12,629 13 2 (Liu et al., 2024)