FDS-CAP: Modeling Fragmented Disease Subgraphs with Component-Level Attention for Comorbidity Prediction

Abstract

Understanding comorbidity between human diseases is essential for uncovering shared pathophysiological mechanisms and improving diagnostic and therapeutic strategies. Although prior studies have investigated genetic and network-based disease associations, they often overlook the fragmented nature of disease modules within the human interactome. To address this limitation, we introduce Fragmented Disease Subgraphs with Component-Level Attention for Comorbidity Prediction (FDS-CAP), a novel graph-based deep learning framework. FDS-CAP first embeds fragmented disease subgraphs using Subgraph Neural Networks (SUBGNN) with component-level attention, then applies a variational comorbidity predictor built upon a Variational Graph Auto-Encoder that is used to predict comorbid disease associations within the Human Disease Network. SUBGNN encodes disease subgraphs by propagating information at the connected component level across three property-aware channels—capturing positional, neighborhood, and structural roles—and integrates a component-level attention mechanism that weighs each connected component based on its significance to the overall subgraph representation. A core contribution of our method is the attention-based aggregation of connected component embeddings, enabling more accurate and expressive disease representations that reflect the biological complexity in fragmented disease subgraphs for improved comorbidity prediction. FDS-CAP achieves state-of-the-art performance for comorbidity prediction on a benchmark dataset, with an AUROC of 0.966. We further illustrate its biological interpretability through a single representative case study on glioma, showing that attention-weighted subgraph components capture meaningful patterns associated with disease mechanisms.

Keywords

component-level attention disease comorbidity fragmented subgraphs GNNs human interactome and HDNs

1. INTRODUCTION

Comorbidity, defined as the co-occurrence of two or more diseases in an individual, is a growing challenge in modern health care, specifically among aging populations. In the United States, approximately 79% of adults aged 60 and older live with multiple chronic conditions (National Council on Aging, 2024), which complicates clinical management, increases health care costs, and worsens patient outcomes. Understanding the molecular mechanisms underlying comorbid diseases is therefore crucial to advancing disease diagnosis and treatment. Recent advances in high-throughput molecular profiling have enabled researchers to investigate disease relationships from a systems biology perspective. One widely used approach involves the construction of Human Disease Networks (HDNs), where diseases are represented as nodes and edges capture shared molecular or phenotypic features (Hidalgo et al., 2009; Barabási et al., 2011), including gene expression profiles (Yang et al., 2016), symptom co-occurrence (Zhou et al., 2014), and health care cost patterns (Ma et al., 2020). Among these, gene-centric HDNs, which connect diseases based on shared or functionally related genes (Goh et al., 2007), provide a biologically grounded framework for mapping disease-associated genes onto the human interactome, thereby facilitating the exploration of molecular mechanisms underlying disease associations. Within the interactome, diseases are often conceptualized as modules of interacting proteins that collectively represent the molecular footprint of a pathological condition (Barabási et al., 2011; Menche et al., 2015). Network-based methods have leveraged this conceptual perspective of disease module modeling to identify disease modules (Ghiassian et al., 2015), infer novel disease-gene associations (Wu et al., 2008), and study disease similarity and comorbidity based on topological proximity or structural coherence (Menche et al., 2015; Akram and Liao, 2019; Ni et al., 2020; Qin and Liao, 2025). However, a key limitation of many network-based approaches for disease comorbidity analysis is the implicit assumption that genes associated with each disease form a dense, contiguous module within the human interactome. While Menche et al. (2015) acknowledged that disease modules may be fragmented across multiple components due to the incompleteness of current molecular interaction data, their method relies on partial connectivity and module overlap, which may limit its ability to fully capture the complexity of dispersed disease mechanisms relevant to disease comorbidity. Accordingly, we analyze a curated human interactome and benchmark disease-gene association dataset constructed by Menche et al. (2015) and subsequently used in comorbidity prediction models, including Weighted Geometric Embedding (WGE) (Akram and Liao, 2019) and Biologically Supervised Embedding (BSE) (Qin and Liao, 2025). This analysis reveals pervasive fragmentation within each disease subgraph, as mapping disease-associated genes onto the interactome typically yields a subgraph with multiple connected components rather than a single cohesive disease module. As shown in Figure 1, 27.1% of disease subgraphs contain more than 50 connected components (CCs), and notably, in over 92% of cases, fewer than half of the disease-associated genes reside in the largest connected component (LCC) of each subgraph. Moreover, most connected components within disease subgraphs are small and structurally isolated, reflecting the biological diversity of disease mechanisms. These observations underscore a critical modeling limitation, wherein existing approaches often overlook or oversimplify the fragmented nature of disease subgraphs, potentially missing meaningful biological signals encoded within the topology of their connected components. We posit that each connected component within a disease subgraph may hold distinct topological and functional significance, reflecting the inherent complexity and dispersed nature of disease-related perturbations. Therefore, to represent such a fragmented disease subgraph accurately, modeling methods must account for intra-subgraph diversity by incorporating variations in positional context, neighborhood relationships, and the connectivity of its connected components within the interactome, while also considering the distinct contribution of each connected component to the overall disease subgraph representation that supports accurate comorbidity prediction. To address these modeling challenges, we propose FDS-CAP, a novel two-stage graph-based deep learning framework for modeling Fragmented Disease Subgraphs with Component-level Attention for comorbidity Prediction (Fig. 2). The first stage models fragmented disease subgraphs using Subgraph Neural Networks (SUBGNN) with a component-level attention mechanism to generate topologically informative disease embeddings (Alsentzer et al., 2020). The second stage, based on a Variational Graph Auto-Encoder (VGAE) (Kipf and Welling, 2016), leverages these embeddings to predict comorbid associations between diseases. Our main contributions are as follows:

FIG. 1.

Fragmentation characteristics of disease modules. Each panel summarizes fragmentation features of 299 disease modules projected onto the human interactome. The left panel illustrates the number of connected components (CCs) per module, capturing overall fragmentation. The middle panel reports the average size of connected components within each module. Lower values indicate more severe fragmentation into smaller components, while higher values suggest that modules, though potentially fragmented, retain larger substructures. The right panel depicts the proportion of genes in the largest connected component (LCC) of each module, reflecting the degree of internal cohesion.

FIG. 2.

Overview of the FDS-CAP framework. (a) Fragmented disease subgraph modeling with component-level attention: disease modules are represented as subgraphs within the human interactome and encoded using Subgraph Neural Networks (SUBGNN). Each subgraph $S_{i}$ is decomposed into multiple connected components $C C_{i, j}$ , each embedded via message passing across multiple property-aware channels (position $A_{P}$ , neighborhood $A_{N}$ , and structure $A_{S}$ ). The resulting component embeddings $h_{i, j}$ are then aggregated using learnable attention weights $α_{i, j}$ to form a unified disease embedding $z_{S_{i}}$ . The Human Disease Network (HDN) adjacency matrix $A_{HDN}$ is constructed using clinical Relative Risk data, where each edge $e_{S_{i}, S_{j}} \in {0, 1}$ denotes a statistically significant comorbid association. (b) Variational comorbidity predictor: the disease embedding matrix $Z_{S}$ and HDN adjacency matrix $A_{HDN}$ are inputs to the Variational Graph Auto-Encoder (VGAE). The encoder computes the mean $μ$ and log-variance $\log σ^{2}$ vectors for all disease nodes in the HDN using Graph Convolutional Networks (GCNs), followed by sampling of latent disease embedding ${\tilde{z}}_{S_{i}}$ . A feedforward neural network (FNN) decoder then predicts the probability ${\hat{y}}_{S_{i}, S_{j}} \in [0, 1]$ of comorbidity for a given disease pair, using the concatenation of their sampled latent embeddings $[{\tilde{z}}_{S_{i}} ‖ {\tilde{z}}_{S_{j}}]$ .

•

We model fragmented disease subgraphs in the human interactome and adopt multi-channel message passing to capture positional, neighborhood, and structural features of this fragmentation.

•

We introduce a component-level attention mechanism that operates over the connected components within each fragmented disease subgraph, capturing their individual relevance and weighting their contributions to the overall disease subgraph representation.

•

We apply VGAE to predict disease comorbidities by modeling latent structure in the HDN.

•

We validate the model’s effectiveness through comorbidity prediction experiments and assess its biological interpretability using a single illustrative glioma case study.

2. RELATED WORK

Recent advances in disease similarity prediction have followed two prominent methodological directions: geometric embedding and graph-based deep learning methods. On one hand, geometric embedding approaches aim to project the human interactome into low-dimensional spaces that preserve structural relationships between genes in the interactome. WGE uses Multidimensional Scaling (Shepard, 1980) to embed the human interactome into a low-dimensional geometric space and generate embeddings of genes associated with each disease for comorbidity prediction (Akram and Liao, 2019). Building on this direction, BSE selects biologically informative dimensions of embeddings for disease-associated genes from the interactome’s geometric embedding space to enhance comorbidity prediction performance (Qin and Liao, 2025). On the other hand, graph-based deep learning methods have been applied to model biological relationships among genes and identify disease similarities. Li et al. (2021) combined two gene-related resources, including the human gene network and the Gene Ontology (GO) hierarchy, to learn embeddings of disease-related genes that preserve both first- and second-order proximities using the Large-scale Information Network Embedding (Tang et al., 2015). Similarly, Chen et al. (2022) introduced CoGO, a contrastive learning framework that also integrates the human gene network and GO hierarchy. CoGO utilizes Graph Convolutional Networks (GCNs) and Relational GCNs to embed genes and their annotated GO terms into a shared embedding space, which is then leveraged to infer disease similarities.

Although geometric embedding methods do not explicitly assume that disease subgraphs are fully connected, they collapse each disease into a single vector obtained by aggregating embeddings of its associated genes, thereby ignoring the topological diversity induced by the fragmentation within a disease subgraph, wherein its associated genes form multiple components distributed across the interactome. This limitation is evidenced by our analysis of the same benchmark interactome and disease-gene associations used by WGE and BSE, which shows that most disease subgraphs decompose into many scattered connected components spanning distinct interactome regions. By collapsing this fragmented, multicomponent organization into a single cohesive module, geometric embedding approaches tend to overlook biologically meaningful signals encoded in component-level topology and, as a result, may fail to fully capture the complex and distributed nature of disease-related perturbations in the interactome.

3. MATERIALS AND METHODS

3.1. Materials

3.1.1. Human interactome

The human interactome is constructed as a large-scale protein–protein interactions (PPIs) network that encompasses a variety of experimentally validated molecular interactions (Menche et al., 2015). This network integrates regulatory interactions inferred from transcription factor binding to regulatory elements; literature-curated protein interactions primarily derived from low-throughput experimental studies; and binary interactions obtained from high-throughput yeast two-hybrid experiments and curated literature sources. It also includes metabolic enzyme-coupled interactions, protein complexes, kinase–substrate pairs, and signaling interactions. The union of these diverse sources yields a comprehensive network of 13,460 protein-coding genes, connected by 141,296 unique interactions between their protein products. We model the human interactome as an undirected graph $G_{inter} = (V, E)$ , where V is the set of these genes, and $E \subseteq V \times V$ denotes the interactions between their protein products. For comorbidity analysis, we extract the LCC of $G_{inter}$ , which spans 13,329 nodes and 141,150 edges, covering more than 99% of $G_{inter}$ . Our focus on the LCC ensures that disease subgraphs are embedded within a well-connected region of $G_{inter}$ , while avoiding isolated nodes and disconnected regions that may result from incomplete data in $G_{inter}$ . Table 1 summarizes the key statistics of the human interactome.

Table 1.
Statistics of the Human Interactome and the Human Disease Networks

Network Nodes (genes or diseases) Edges (interactions or associations)

Human Interactome (Full) 13,460 141,296

Human Interactome (LCC ^a ) 13,329 141,150

HDN ^b (RR ^c threshold = 0) 299 8874

HDN ^b (RR ^c threshold = 1) 299 6269

Network	Nodes (genes or diseases)	Edges (interactions or associations)
Human Interactome (Full)	13,460	141,296
Human Interactome (LCC ^a )	13,329	141,150
HDN ^b (RR ^c threshold = 0)	299	8874
HDN ^b (RR ^c threshold = 1)	299	6269

Largest connected component.

Human disease network.

Relative risk.

3.1.2. Human disease network construction

To investigate disease comorbidity from both clinical and biological perspectives, we construct HDN that integrates patient-level disease co-occurrence data with molecular data (Fig. 2a). In this network, each node represents a disease, and edges indicate statistically significant comorbid associations between diseases. These associations are quantified using the Relative Risk (RR), defined as the ratio between the observed co-occurrence of two diseases and the expected occurrence under the assumption of independence. An RR value greater than a specified threshold indicates that two diseases co-occur more frequently than would be expected by chance, and an edge is assigned between them in the HDN. The RR values are computed using a large-scale Medicare dataset comprising the disease history of approximately 30 million individuals aged 65 and older (Menche et al., 2015). From this dataset, 10,743 disease pairs were identified, of which 8,874 and 6,269 satisfy $RR > 0$ (threshold = 0) and $RR > 1$ (threshold = 1), respectively (Table 1). These disease-pair subsets are used to construct two HDNs, which serve as inputs to the variational comorbidity predictor in our framework. We integrate the comorbid associations in each HDN with molecular data by projecting disease-associated genes onto the human interactome. Each disease is mapped to a curated set of genes known to be implicated in its pathology, obtained from a benchmark dataset (Menche et al., 2015) covering 299 diseases. When these gene sets are overlaid onto $G_{inter}$ described previously, they define disease-specific subgraphs—induced subgraphs of $G_{inter}$ that represent the molecular footprint of each disease. Formally, we define a disease subgraph $S_{i} = (V_{S_{i}}, E_{S_{i}}) \subseteq G_{inter}$ , where $V_{S_{i}} \subseteq V$ represents the disease-associated genes, and $E_{S_{i}} = {(u, v) \in E ∣ u, v \in V_{S_{i}}}$ captures the interactions between their protein products. Each disease subgraph $S_{i}$ consists of multiple connected components, represented as $S_{i}^{(CC)} = {C C_{i, 1}, C C_{i, 2}, \dots, C C_{i, n_{i}}}$ , where $n_{i}$ is the number of connected components in the disease subgraph $S_{i}$ .

3.2. Methods

We present FDS-CAP, a novel two-stage graph-based deep learning framework for predicting disease comorbidity from fragmented disease subgraphs within the human interactome. In the first stage, SUBGNN is employed to generate disease subgraph representations by capturing the fragmentation of disease subgraphs into multiple connected components and incorporating component-level attention to weigh the relative importance of each connected component within the subgraphs. In the second stage, VGAE is used to model latent relationships between diseases in the HDN and infer their comorbid associations. Each stage is described in detail below.

3.2.1. Disease subgraph representation

Let $S = {S_{1}, S_{2}, \dots, S_{m}}$ denote a set of disease subgraphs within the human interactome graph $G_{inter}$ , where each subgraph $S_{i}$ consists of multiple connected components. To model component-level topology within each disease subgraph, we treat each connected component as an independent substructure and leverage SUBGNN, a neural message-passing architecture designed to learn expressive representations of topological patterns within a disease subgraph by capturing the unique characteristics of each connected component (Fig. 2a). Specifically, each connected component $C C_{i, j} \in S_{i}^{(CC)}$ is encoded through multiple property-aware channels, each comprising an internal subchannel (capturing the component’s internal topology) and a border subchannel (capturing its interactions with the surrounding interactome $G_{inter}$ ), as detailed in Table 2.

Table 2.
Overview of Six Topological Properties Used in SUBGNN to Learn Representation of the Disease Subgraph $S_{i}$ Within the Human Interactome Graph $G_{inter}$

Position Internal Distance between $S_{i}$ ’s components

Border Distance between $S_{i}$ and the rest of $G_{inter}$

Neighborhood Internal Defines a set of internal neighbor nodes of $S_{i}$

Border Defines a set of border neighbor nodes of $S_{i}$

Structure Internal Internal connectivity of each component within $S_{i}$

Border Border connectivity of each component within $S_{i}$

For each channel, SUBGNN randomly samples a set of anchor patches $A_{x}$ from the human interactome graph $G_{inter}$ and propagates messages that quantify the relevance of each anchor to each connected component in a disease subgraph. Each message from an anchor patch to a disease subgraph is defined as ${MSG}_{x}^{A \to S_{i}} = κ_{x} (C C_{i, j}, A_{x}) \cdot a_{x}$ , where $κ_{x} : (C C_{i, j}, A_{x}) \to [0, 1]$ is a channel-specific similarity function computed between each connected component $C C_{i, j}$ in subgraph $S_{i}$ and anchor patch $A_{x} \in A_{x}$ , and $a_{x} \in R^{d}$ is the embedding of the anchor patch $A_{x}$ . For the position channel, we define similarity based on the shortest path distance between the connected component $C C_{i, j}$ and the position anchor patch $A_{P}$ , as $κ_{P} (C C_{i, j}, A_{P}) = 1 / d_{s p} (C C_{i, j}, A_{P}) + 1$ , where $d_{s p}$ is the average shortest path length. For the neighborhood channel, $κ_{N} (C C_{i, j}, A_{N})$ is set to 1 for internal neighborhood anchors, while for border neighborhood anchors, it satisfies $κ_{N} (C C_{i, j}, A_{N}) \leq K$ , where K is the maximum number of hops from any node in the component to its boundary neighbors in the anchor. For the structure channel, we define similarity based on the alignment of ordered degree sequences. Particularly, we compute $κ_{S} (C C_{i, j}, A_{S}) = 1 / DTW (d_{C C_{i, j}}, d_{A_{S}}) + 1$ , where DTW refers to the normalized dynamic time warping distance (Mueen and Keogh, 2016), which quantifies the minimum cumulative alignment cost between the degree sequences $d_{C C_{i, j}}$ and $d_{A_{S}}$ of the connected component and the structure anchor patch, respectively. After message propagation, each connected component within the disease subgraph aggregates the information across all anchor patches per channel and combines the channel-wise vectors into a unified component embedding $h_{i, j} \in R^{d}$ .

While all connected components $C C_{i, j} \in S_{i}^{(CC)}$ belong to the same disease subgraph $S_{i}$ , their contributions to the overall subgraph representation may vary substantially. Certain components may contain densely connected or disease-relevant genes that play a central role in disease pathogenesis, whereas others may be topologically marginal or biologically less informative. To account for this heterogeneity, we apply the additive attention mechanism introduced by Bahdanau et al. (2015) into the SUBGNN-based disease representation model. Specifically, the attention mechanism operates over the set of topology-aware component embeddings ${h_{i, j}}_{j}$ generated by SUBGNN for the connected components $C C_{i, j} \in S_{i}^{(CC)}$ . It assigns a learnable relevance weight $α_{i, j}$ to each connected component based on its embedding $h_{i, j}$ , defined as follows:

α_{i, j} = \frac{\exp (v^{⊤} \tanh (W y + U h_{i, j}))}{\sum_{k = 1}^{n_{i}} \exp (v^{⊤} \tanh (W y + U h_{i, k}))}

(1)

where

v \in R^{d \times 1}

is a scoring vector,

y \in R^{d}

is a query vector,

W, U \in R^{d \times d}

are transformation matrices, and tanh denotes the hyperbolic tangent activation function. All attention parameters are trainable, shared across disease subgraphs, and optimized jointly with the component embeddings as part of the SUBGNN-based disease representation objective, enabling consistent comparison of connected component relevance. Consequently, this design allows the model to upweight connected components whose embeddings encode the most informative positional, neighborhood, and structural signals within fragmented disease subgraphs in the human interactome, thereby enhancing the expressiveness of the learned disease representations.

The final disease subgraph embedding $z_{S_{i}} \in R^{d}$ is computed as the attention-weighted sum of all connected component embeddings within the subgraph:

z_{S_{i}} = \sum_{j = 1}^{n_{i}} α_{i, j} \cdot h_{i, j}

(2)

Although the attention mechanism is learned jointly as part of the SUBGNN-based disease representation objective, it is not directly constrained by component-level supervision, i.e., there are no labels specifying which connected components should receive higher weights. Nevertheless, its relevance is supported by two complementary lines of evidence. From a predictive perspective, disease embeddings aggregated using component-level attention yield substantially improved comorbidity prediction performance compared with state-of-the-art geometric embedding methods such as WGE and BSE under identical data and evaluation settings (Tables 3 and 4), suggesting that the attention mechanism prioritizes the most informative connected components for disease representation and thereby improves downstream comorbidity prediction. From a biological interpretability perspective, the glioma case study in Section 5.2 shows that connected components assigned higher attention weights tend to be more biologically informative and are enriched for genes with known or emerging relevance to glioma biology, such as GSDMC, CDKN2B-AS1, RTEL1, and IDH1. Together, these findings indicate that the learned attention weights highlight biologically meaningful substructures within fragmented disease subgraphs.

Table 3.

Comparison of Average Model Performance Between the Proposed FDS-CAP and State-of-the-Art Models under the RR Threshold = 0 Setting

Evaluation metrics	Models
Evaluation metrics	FDS-CAP (Ours)	WGE^a	BSE^b	p-value	std^c
AUROC ^d	0.966	0.90	0.827	5.22E–11	0.0116
Accuracy	0.919	0.90	0.827	3.66E–09	0.0124
F1	0.921	0.89	0.843	8.72E–09	0.0117
AP ^e	0.957	—	0.753	1.59E–11	0.0150

Weighted geometric embedding.

Biologically supervised embedding.

Standard deviation.

Area under the receiver operating characteristic curve.

Average precision.

Bold values indicate the best (highest) performance among the compared models for each evaluation metric.

3.2.2. Comorbidity prediction

The learned disease embeddings produced by SUBGNN in the first stage of our framework are represented by a matrix $Z_{S} \in R^{m \times d}$ , where each row $z_{S_{i}}$ corresponds to the attention-aggregated embedding of disease subgraph $S_{i}$ . These embeddings are passed into the variational comorbidity predictor, modeled after VGAE (Fig. 2b), to infer disease comorbidity. The comorbidity predictor captures latent relationships between diseases by reconstructing the HDN adjacency matrix $A_{HDN}$ , where each entry indicates the presence or absence of a statistically significant comorbidity association between two diseases. Following the VGAE model, the predictor’s encoder computes a mean matrix $μ = {GCN}_{μ} (Z_{S}, A_{HDN})$ and a log-variance matrix $\log σ^{2} = {GCN}_{σ} (Z_{S}, A_{HDN})$ . The graph convolutional networks ${GCN}_{μ}$ and ${GCN}_{σ}$ are implemented as two-layer GCNs, defined in the form $GCN (Z_{S}, A_{HDN}) = {\tilde{A}}_{HDN} \cdot ReLU ({\tilde{A}}_{HDN} Z_{S} W_{0}) W_{1}$ , where ${\tilde{A}}_{HDN} = D^{- 1 / 2} A_{HDN} D^{- 1 / 2}$ is the symmetrically normalized adjacency matrix. The first-layer weight matrix $W_{0}$ is shared across both networks, while the second-layer weight matrices $W_{1}$ are distinct. The rectified linear unit function is defined as $ReLU (x) = \max (0, x)$ .

We model each disease latent variable ${\tilde{z}}_{S_{i}}$ as drawn from a multivariate Gaussian distribution, ${\tilde{z}}_{S_{i}} \sim N (μ_{S_{i}}, diag (σ_{S_{i}}^{2}))$ . To enable gradient-based optimization during training, we apply the reparameterization trick (Kingma and Welling, 2014) to sample a latent disease embedding as ${\tilde{z}}_{S_{i}} = μ_{S_{i}} + ϵ_{S_{i}} ⊙ σ_{S_{i}}$ , where $ϵ_{S_{i}} \sim N (0, I)$ , and $σ_{S_{i}} = \exp (0.5 \cdot \log σ_{S_{i}}^{2})$ .

The decoder of the predictor is implemented as a feedforward neural network (FNN) that takes the concatenation of sampled latent embeddings of a disease pair, $[{\tilde{z}}_{S_{i}} ‖ {\tilde{z}}_{S_{j}}]$ , and predicts the probability of comorbidity between the two diseases, formulated as follows:

{\hat{y}}_{S_{i}, S_{j}} = FNN ([{\tilde{z}}_{S_{i}} ‖ {\tilde{z}}_{S_{j}}])

(3)

4. EXPERIMENTS

4.1. Experimental setup

We conduct comorbidity prediction experiments using benchmark datasets that integrate disease-gene associations, molecular interaction data, and clinically observed disease relationships. To contextualize disease modules within a coherent molecular framework, each disease is embedded as a subgraph within the LCC of the human interactome graph $G_{inter}$ . These molecular embeddings are subsequently combined with clinically observed comorbidity relationships in the HDN, which serves as the foundation for the comorbidity prediction. In this setting, each edge in the HDN represents a clinically validated comorbid association and serves as a positive ground truth label for training and evaluation of the FDS-CAP comorbidity predictor. To ensure class balance, an equal number of negative samples are generated by randomly sampling disease pairs that are not connected in the HDN to represent non-comorbid relationships. Experiments are performed using a 10-fold cross-validation strategy, where disease pairs are partitioned into ten folds. In each fold, 90% of the data is used for training and validation, and the remaining 10% is held out for testing.

4.2. Computational complexity and runtime analysis

The first stage uses SUBGNN to encode each disease subgraph by performing message passing over its connected components and a fixed number of randomly sampled anchor patches. As a result, runtime scales approximately linearly with the number of disease subgraphs and their sizes. Following prior SUBGNN work, anchor–subgraph similarities can be precomputed, which reduces both memory usage and runtime. With this precomputation, training SUBGNN on our interactome requires 24 s per epoch on a single RTX 4070 GPU (12 GB VRAM). In our implementation, SUBGNN is executed once to generate fixed embeddings for all disease subgraphs, which are then fed into the second-stage predictor. The variational comorbidity predictor operates on the HDN at the disease level; therefore, its computational cost depends on the number of diseases and comorbidity links rather than the interactome size. On the same RTX 4070 GPU, full 10-fold cross-validation takes 160 s for RR threshold = 0 and 127 s for RR threshold = 1, including training, validation, and testing.

4.3. Evaluation metrics and comparison

We evaluate the performance of FDS-CAP for comorbidity prediction using standard binary classification metrics, including area under the receiver operating characteristic curve (AUROC), accuracy, F1-score, and average precision (AP).

We compare FDS-CAP against WGE and BSE, two state-of-the-art geometric embedding methods evaluated on the same gene-centric human interactome, disease-gene associations, and comorbid disease pairs used in this study. Both methods first obtain gene embeddings by projecting the human interactome into a low-dimensional geometric space and then construct disease embeddings by aggregating the embeddings of disease-associated genes. However, these methods do not explicitly model the fragmentation of disease subgraphs into multiple connected components. Instead, they treat disease-associated genes as belonging to a single cohesive module and thus ignore component-level topology. WGE and BSE are therefore the most directly comparable methods for isolating the impact of component-level modeling addressed by FDS-CAP. To ensure consistency in evaluation, we reproduce BSE using the official code released by the authors and evaluate it on the same 10-fold cross-validation splits used in FDS-CAP. For WGE, we report the results published in the original study.

5. RESULTS

5.1. Performance analysis on comorbidity prediction

The averaged performance of FDS-CAP across 10 test folds on the comorbidity prediction task using RR thresholds of 0 and 1 is summarized in Tables 3 and 4. While RR threshold = 1 restricts the HDN to stronger, clinically confident disease comorbid associations, RR threshold = 0 includes a broader set of comorbid pairs, leading to increased edge density in the HDN. This enhanced connectivity supports richer information propagation between diseases in the HDN, enabling the comorbidity predictor to better capture latent comorbid relationships. Therefore, under the RR threshold = 0 setting, FDS-CAP achieves its highest performance, reflecting its ability to exploit richer network structure and capture complex comorbidity patterns. Although performance decreases slightly under the more stringent RR threshold = 1 setting due to reduced connectivity, the model still demonstrates strong and robust predictive performance across all evaluation metrics. Figures 3 and 4 present the receiver operating characteristic (ROC) and precision-recall curves, along with their corresponding AUROC and area under the precision-recall curve values, illustrating the model’s consistent performance across test folds under both RR threshold settings.

FIG. 3.

Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves of the FDS-CAP model under the Relative Risk (RR) threshold = 0 setting. Each curve represents the test performance of an individual fold from 10-fold cross-validation. The left panel depicts ROC curves with the Area Under the ROC Curve (AUROC) values, along with the average performance across all folds. The right panel presents PR curves with the Area Under the PR Curve (AUPRC) values, along with the average performance.

FIG. 4.

Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves of the FDS-CAP model under the Relative Risk (RR) threshold = 1 setting. Each curve represents the test performance of an individual fold from 10-fold cross-validation. The left panel depicts ROC curves with the Area Under the ROC Curve (AUROC) values, along with the average performance across all folds. The right panel presents PR curves with the Area Under the PR Curve (AUPRC) values, along with the average performance.

In the comparative evaluation, FDS-CAP consistently outperforms state-of-the-art geometric embedding methods for comorbidity prediction across all core metrics. Under the RR threshold = 0 setting (Table 3), FDS-CAP achieves an AUROC of 0.966 and an AP of 0.957, corresponding to relative improvements of 11.8% in AUROC (relative to the average performance of WGE and BSE) and 27.0% in AP relative to BSE. Accuracy and F1-score also increase by 6.4% and 6.2%, respectively, relative to the average performance of WGE and BSE. Under the stricter RR threshold = 1 setting (Table 4), FDS-CAP maintains strong predictive performance, achieving an AUROC of 0.942 and an AP of 0.920. These results represent relative gains of 18.1% in AUROC (relative to the average performance of WGE and BSE) and 20.5% in AP relative to BSE. Accuracy and F1-score improve significantly, by 16.0% and 16.3%, respectively, relative to the average performance of WGE and BSE.

Table 4.

Comparison of Average Model Performance Between the Proposed FDS-CAP and State-of-the-Art Models under the RR Threshold = 1 Setting

Evaluation metrics	Models
Evaluation metrics	FDS-CAP (Ours)f	WGE^a	BSE^b	p-value	std^c
AUROC ^d	0.942	0.76	0.834	9.08E–13	0.0058
Accuracy	0.890	0.70	0.834	4.56E–08	0.0101
F1	0.894	0.69	0.847	4.54E–07	0.0110
AP ^e	0.920	—	0.763	3.07E–12	0.0097

Weighted geometric embedding.

Biologically supervised embedding.

Standard deviation.

Area under the receiver operating characteristic curve.

Average precision.

Bold values indicate the best (highest) performance among the compared models for each evaluation metric.

5.2. Case study: Glioma

In Section 5.1, we evaluate FDS-CAP on comorbidity associations among all 299 diseases and show consistent gains in comorbidity prediction performance. To assess the biological interpretability of FDS-CAP, we present a single representative case study demonstrating that component-level attention over topology-aware connected component embeddings within a fragmented disease subgraph can reveal biologically meaningful substructures. Specifically, we focus on glioma, a group of tumors that arise from glial cells in the brain and spinal cord (National Cancer Institute, 2011), and analyze the ten connected components receiving the highest attention weights in the glioma subgraph embedding. Figure 5 displays these connected components, including representative genes such as GSDMC, CDKN2B-AS1, IDH1, and ERBB2, along with their corresponding attention weights.

FIG. 5.

Top 10 connected components from the glioma subgraph $S_{glioma}$ , ranked by attention weights. Each connected component $C C_{glioma, j} \in S_{glioma}^{(CC)}$ is labeled with its representative genes.

The highest-weighted connected component contains the gene GSDMC (Gasdermin C). Although GSDMC has not been widely recognized in canonical glioma signaling pathways, recent findings suggest that it may play a critical regulatory role via pyroptosis. In particular, a study has shown that under hypoxic conditions, programmed death-ligand 1 (PD-L1) and phosphorylated STAT3 can transcriptionally activate GSDMC, leading to a shift in tumor cell death from apoptosis to pyroptosis (Hou et al., 2020). This regulatory switch has implications for tumor progression, immune evasion, and therapeutic resistance, which are all central challenges in glioma. As a result, the model’s prioritization of GSDMC supports its potential as a noncanonical but functionally significant contributor to glioma biology. The second-highest weighted connected component includes CDKN2B-AS1, a long noncoding RNA that regulates cell cycle progression and senescence through multiple cell cycle-related pathways (Rathi et al., 2020). In the context of glioma, CDKN2B-AS1 is upregulated and promotes tumor cell proliferation and invasion by regulating the miR-199a-5p/DDR1 axis (Lu et al., 2022). Collectively, these findings show that FDS-CAP highlights biologically meaningful substructures by prioritizing the connected components that are most informative within the glioma subgraph and are not typically emphasized by canonical glioma enrichment analyses.

Several moderately weighted connected components also include genes with established relevance to glioma. For instance, high expression of RTEL1 is associated with telomere elongation and poor prognosis in glioma patients, especially in patients with wild-type telomerase reverse transcriptase. RTEL1 promotes tumorigenesis by increasing glioma cell proliferation, migration, and invasion through activation of the ROS-mediated JNK/ELK1 signaling cascade (Wang et al., 2024). Similarly, IDH1 is a well-characterized glioma driver that disrupts cellular metabolism and induces widespread epigenetic alterations (Ward et al., 2010; Turcan et al., 2012).

Finally, while the lower-weighted connected components among the top 10 contain broadly oncogenic genes such as ERBB2, BRCA2, and BRAF, their limited contributions to the glioma subgraph representation suggest that the model effectively distinguishes glioma-specific mechanisms from more ubiquitous cancer signals. This distinction is critical for accurate comorbidity prediction and disease-specific subgraph characterization. In summary, the glioma case study underscores FDS-CAP’s potential to uncover biologically significant, non-canonical signals from fragmented disease subgraphs via component-level attention over topology-aware representations of their connected components.

6. CONCLUSION

In this study, we introduce FDS-CAP, a novel graph-based deep learning framework that addresses the challenge of modeling fragmented disease subgraphs for comorbidity prediction. FDS-CAP treats each disease subgraph as a collection of connected components and applies component-level representation learning via SUBGNN. The framework incorporates multiple property-aware channels—position, neighborhood, and structure—to capture the topological characteristics of each connected component, while prioritizing those with higher biological relevance to generate informative disease embeddings. These embeddings are then passed to the variational comorbidity predictor, which leverages the HDN to infer latent disease relationships and predict comorbid associations among them. Our approach achieves strong predictive performance, outperforming existing methods, particularly in settings that preserve network complexity and biological nuance. While FDS-CAP effectively integrates component-level attention across multiple channels, the relative contribution of each channel to the overall weighting of connected component embeddings remains an open question. In future work, we aim to disentangle and assess the individual impact of each channel on the attention weights and the resulting disease embeddings. Such analysis may offer deeper biological interpretability and help reveal which topological features are most informative for learning these embeddings and, consequently, the disease comorbidity prediction.

CODE AND DATA AVAILABILITY

The source code and data required to reproduce the comorbidity prediction experiments are publicly available at https://github.com/Ashwag-ta/FDS_CAP.

Footnotes

AUTHOR DISCLOSURE STATEMENT

No interests to disclose.

FUNDING INFORMATION

No funding was received for this article.

Supplemental Material

References

Akram

, Liao

. Prediction of comorbid diseases using weighted geometric embedding of human interactome. BMC Med Genomics 2019;12(Suppl 7):161; doi: 10.1186/s12920-019-0605-5

Alsentzer

, Finlayson

, Li

, et al. Subgraph neural networks. In: Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada, 2020; pp. 8017–8029; doi: 10.48550/arXiv.2006.10538

Bahdanau

, Cho

, Bengio

. Neural machine translation by jointly learning to align and translate. In: Proceedings of the International Conference on Learning Representations. San Diego, USA, 2015; doi: 10.48550/arXiv.1409.0473

Barabási

A-L

, Gulbahce

, Loscalzo

. Network medicine: A network-based approach to human disease. Nat Rev Genet 2011;12(1):56–68; doi: 10.1038/nrg2918

Chen

, Hu

, et al. CoGO: A contrastive learning framework to predict disease similarity based on gene network and ontology structure. Bioinformatics 2022;38(18):4380–4386; doi: 10.1093/bioinformatics/btac520

Ghiassian

, Menche

, Barabási

A-L

. A DIseAse MOdule detection (DIA-MOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS Comput Biol 2015;11(4):e1004120; doi: 10.1371/journal.pcbi.1004120

Goh

K-I

, Cusick

, Valle

, et al. The human disease network. Proc Natl Acad Sci USA 2007;104(21):8685–8690; doi: 10.1073/pnas.0701361104

Hidalgo

, Blumm

, Barabási

A-L

, et al. A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 2009;5(4):e1000353; doi: 10.1371/journal.pcbi.1000353

Hou

, Zhao

, Xia

, et al. PD-L1-mediated gasdermin C expression switches apoptosis to pyroptosis in cancer cells and facilitates tumor necrosis. Nat Cell Biol 2020;22(10):1264–1275; doi: 10.1038/s41556-020-0575-z

10.

Kingma

, Welling

. Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations. Banff, Canada; 2014; doi: 10.48550/arXiv.1312.6114

11.

Kipf

, Welling

. Variational graph auto-encoders. In: NIPS Workshop on Bayesian Deep Learning. NIPS: Barcelona, Spain, 2016; doi: 10.48550/arXiv.1611.07308

12.

, Keqi

, Wang

. Evaluating disease similarity based on gene network reconstruction and representation. Bioinformatics 2021;37(20):3579–3587; doi: 10.1093/bioinformatics/btab252

13.

, Chen

, Wen

, et al. LncRNA CDKN2B‐AS1 contributes to glioma development by regulating the miR‐199a‐5p/DDR1 axis. J Gene Med 2022;24(1):e3389; doi: 10.1002/jgm.3389

14.

, Li

, Shia

, et al. Human disease cost network analysis. Stat Med 2020;39(9):1237–1249; doi: 10.1002/sim.8472

15.

Menche

, Sharma

, Kitsak

, et al. Uncovering disease-disease relationships through the incomplete interactome. Science 2015;347(6224):1257601; doi: 10.1126/science.1257601

16.

Mueen

, Keogh

. Extracting optimal performance from dynamic time warping. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: San Francisco, CA; 2016; pp. 2129–2130; doi: 10.1145/2939672.2945383

17.

National Cancer Institute. NCI dictionary of cancer terms: Glioma. Bethesda, MD; 2011. Available from: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/glioma [Last accessed: May 10, 2025].

18.

National Council on Aging. The top 10 most common chronic conditions in older adults. Washington, DC; 2024. Available from: https://www.ncoa.org/article/the-top-10-most-common-chronic-conditions-in-older-adults [Last accessed: May 10, 2025].

19.

, Wang

, Zhong

, et al. Constructing disease similarity networks based on disease module theory. IEEE/ACM Trans Comput Biol Bioinform 2020;17(3):906–915; doi: 10.1109/TCBB.2018.2817624

20.

Qin

, Liao

Improving disease comorbidity prediction with biologically supervised graph embedding. In: Computational Advances in Bio and Medical Sciences. ( Bansal

, et al. eds.) Springer: Cham, Switzerland, 2025; pp. 178–190; doi: 10.1007/978-3-031-82768-6_16

21.

Rathi

, Danford

, Gudiseva

, et al. Molecular genetics and functional analysis implicate CDKN2BAS1-CDKN2B involvement in POAG pathogenesis. Cells 2020;9(9):1934; doi: 10.3390/cells9091934

22.

Shepard

. Multidimensional scaling, tree-fitting, and clustering. Science 1980;210(4468):390–398; doi: 10.1126/science.210.4468.390

23.

Tang

, Qu

, Wang

, et al. LINE: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee: Florence, Italy, 2015; pp. 1067–1077; doi: 10.1145/2736277.2741093

24.

Turcan

, Rohle

, Goenka

, et al. IDH1 mutation is sufficient to establish the glioma hypermethylator phenotype. Nature 2012;483(7390):479–483; doi: 10.1038/nature10866

25.

Wang

, Ren

, Li

, et al. High expression of RTEL1 predicates worse progression in gliomas and promotes tumorigenesis through JNK/ELK1 cascade. BMC Cancer 2024;24(1):385; doi: 10.1186/s12885-024-12134-8

26.

Ward

, Patel

, Wise

, et al. The common feature of leukemia-associated IDH1 and IDH2 mutations is a neomorphic enzyme activity converting alpha-ketoglutarate to 2-hydroxyglutarate. Cancer Cell 2010;17(3):225–234; doi: 10.1016/j.ccr.2010.01.020

27.

, Jiang

, Zhang

, et al. Network‐based global inference of human disease genes. Mol Syst Biol 2008;4(1):189; doi: 10.1038/msb.2008.27

28.

Yang

, Wu

S-J

, Yang

S-Y

, et al. DNetDB: The human disease network database based on dysfunctional regulation mechanism. BMC Syst Biol 2016;10(1):36; doi: 10.1186/s12918-016-0280-5

29.

Zhou

, Menche

, Barabási

A-L

, et al. Human symptoms–disease network. Nat Commun 2014;5(1):4212; doi: 10.1038/ncomms5212

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.31 MB

0.00 MB

Position	Internal	Distance between $S_{i}$ ’s components
	Border	Distance between $S_{i}$ and the rest of $G_{inter}$
Neighborhood	Internal	Defines a set of internal neighbor nodes of $S_{i}$
	Border	Defines a set of border neighbor nodes of $S_{i}$
Structure	Internal	Internal connectivity of each component within $S_{i}$
	Border	Border connectivity of each component within $S_{i}$