MCAMamba: A multimodal method with bidirectional cross-attention and state space model for cancer survival prediction

Abstract

Cancer survival prediction is crucial for clinical decision-making and personalized treatment planning. The joint analysis of pathological images and genomic profiles provides complementary information at both histological and molecular levels, offering a more comprehensive foundation for patient prognosis assessment. However, existing multimodal survival prediction methods face two major challenges: (1) How to efficiently capture global dependencies in high-dimensional, long sequence features while maintaining linear complexity? (2) How to fully preserve and utilize the inherent valuable information of each modality while achieving cross-modal interaction? To address these challenges, we propose MCAMamba, a multimodal method with bidirectional cross-attention and a state space model for cancer survival prediction. This method employs a parallel encoder–decoder architecture, leveraging Mamba’s efficient long sequence modeling capabilities to capture global dependencies and discriminative features within each modality. Meanwhile, a Bidirectional Cross-Attention module is integrated into the framework to achieve semantic alignment and complementary information exchange across modalities, enhancing the prediction of patient survival risk. Experimental results on four public TCGA cancer datasets (BLCA, BRCA, UCEC, and LUAD) demonstrate that MCAMamba significantly outperforms existing methods in predictive performance. The c-index improves by 2.47%-17.9%, validating the superior performance of the method in multimodal cancer survival prediction.

Keywords

whole slide image multimodal learning bidirectional cross-attention Mamba survival prediction

1. Introduction

Cancer, characterized by its high incidence and mortality rates, has become one of the most serious public health challenges worldwide.¹ Accurate prediction of patient survival is crucial for clinical decision-making, as it enables early risk stratification and supports the development of personalized treatment plans. However, due to the complex nature of cancer, relying solely on single-modality data often fails to comprehensively reflect a patient’s true condition. As cornerstones of modern oncology, pathology and genomics provide unique insights into cancer research at the macroscopic morphological and microscopic molecular levels, respectively. Pathological images, particularly whole-slide images (WSIs) at gigapixel resolution, capture spatial information such as tumor architecture, spatial heterogeneity, and the tumor microenvironment, whereas genomic profiles reveal underlying molecular mechanisms. These two modalities are inherently complementary.^2–4 Therefore, obtaining feature representations that are both discriminative and biologically interpretable, and achieving efficient integration of information between pathological images and genomic profiles, has become a major research focus and challenge in cancer survival prediction.

Due to the extremely high resolution of WSIs, survival prediction tasks commonly employ Multiple Instance Learning (MIL) methods. In MIL, a WSI is first divided into multiple patches and encoded into low-dimensional feature representations using pre-trained models.^5–7 These instance features are then aggregated into a bag for downstream tasks. This process formulates WSI feature extraction as a long-sequence modeling problem to capture inter-instance correlations and global contextual information for discriminative feature learning. Although Transformer-based methods^8–10 effectively capture global dependencies in long sequences, their computational complexity has quadratic growth with sequence length, creating a major bottleneck in high-dimensional sequence analysis. To address this limitation, the Selective Space State Sequence Model (Mamba)¹¹ was introduced as an efficient alternative for long-sequence modeling. Existing research demonstrates that Mamba achieves comparable or superior performance to Transformers across multiple tasks while requiring only half the parameters.^12,13 For example, MamMIL¹⁴ integrates Mamba into the MIL framework for WSI analysis, thereby enabling effective modeling of global instance dependencies with linear computational complexity. These findings provide the theoretical foundation and primary motivation for introducing Mamba into multimodal survival prediction tasks in this work.

In recent years, with advances in multimodal learning methods, an increasing number of studies^15–19 have integrated pathological images with genomic data for cancer survival analysis, significantly improving the accuracy of patient survival prediction. Existing multimodal fusion methods can be broadly categorized into two types. The first category includes direct feature fusion methods, which integrate modalities at the feature level through techniques such as concatenation^20,21 and bilinear pooling.^16,22,23 Although these methods generate joint feature representations, they often overlook potential interactions between modalities, resulting in limited representational capacity of the fused features. The second category comprises cross-modal interaction methods, which introduce attention mechanisms to guide and align information across modalities, thereby capturing latent dependencies between pathological and genomic features. Notable works include, Chen et al.²⁴ proposed the Multimodal Co-Attention Transformer (MCAT) framework, which facilitates interaction between pathological and genomic features through a genome-guided co-attention mechanism. Jaume et al.¹⁷ proposed the SurvPath model, which employs sparse attention to simulate interactions between genomic pathways and histological patch tokens, thereby enhancing feature complementarity. Although these methods have made notable progress in improving predictive performance, they primarily emphasize shared intermodal features while failing to fully preserve and utilize the inherent valuable information, leading to the loss of critical intra-modal insights. Therefore, future research should focus on effectively modeling cross-modal interactions while preserving informative intra-modal representations, thereby fully leveraging the complementary strengths of multimodal data to improve cancer survival prediction.

Based on the above observations, we propose MCAMamba, a multimodal method with Bidirectional Cross-Attention and a State Space Model for cancer survival prediction. The proposed method is designed to efficiently model intra-modal representations of pathological images and genomic profiles while exploring cross-modal correlations and complementary information. Specifically, MCAMamba adopts a parallel encoder–decoder architecture. Leveraging Mamba’s efficient long sequence modeling capability, the framework captures intra-modal feature representations for the pathological and genomic modalities. Subsequently, A Bidirectional Cross-Attention (BCA) module is further embedded into the architecture to explicitly model cross-modal correlations and facilitate bidirectional interaction and semantic alignment between the two modalities. Finally, Self-Attention Pooling (SAP) aggregates global representations of each modality for patient survival risk prediction. The main contributions of this work are summarized as follows:

(1) We propose MCAMamba, a multimodal method based on bidirectional cross-attention mechanisms and state space model for cancer survival prediction. This method enables deep interaction and fusion of pathological images and genomic profiles, significantly improving survival prediction accuracy.

(2) Leveraging Mamba’s efficient long sequence modeling capability, we construct a cross-symmetric encoder–decoder architecture to fully capture global dependencies and discriminative features within each modality.

(3) We introduce a Bidirectional Cross-Attention module, embedding within the encoder–decoder architecture. This module explicitly models intrinsic correlations between pathological images and genomic data, enabling bidirectional information guidance and efficient transfer of complementary features.

(4) Extensive experiments on four public TCGA cancer datasets (BLCA, BRCA, UCEC, and LUAD) evaluate the effectiveness of our proposed model. The results demonstrate that our model significantly outperforms existing methods in predictive performance.

The rest of this paper is organized as follows. Section 2 primarily discusses related works. Section 3 introduces the basic concepts of state space model, Mamba, and survival prediction, followed by a detailed description of the components of the MCAMamba framework. Section 4 introduces the cancer datasets, experimental setup, experimental results and model interpretability. Section 5, provides a discussion of the proposed method. Finally, Section 6 concludes the paper.

2. Related work

2.1. Survival prediction from single modality

Pathological images and genomic profiles are increasingly recognized as crucial prognostic indicators for cancer, demonstrating substantial potential in survival prediction tasks. In pathological images analysis, survival prediction research is primarily conducted within the MIL framework. Lee et al.²⁵ proposed DeepSets, pioneering the integration of set-based learning concepts into pathological image feature modeling. This method enables models to learn global representations from unordered patch inputs, laying the foundation for subsequent MIL studies. Ilse et al.²⁶ developed AttentionMIL, which achieves adaptive aggregation of patch-level features through an attention-weighted mechanism, significantly enhancing the model’s discriminative power and interpretability. Shao et al.⁸ introduced TransMIL, incorporating the Transformer architecture into the MIL framework. By explicitly capturing global dependencies and correlations among patches via a self-attention mechanism, it strengthens the model’s ability to learn long-range dependencies. Yao et al.²⁷ proposed DeepAttnMISL, employing an attention-guided MIL pooling strategy to adaptively weight patch features from WSIs at the patient level for cancer survival analysis, thereby further improving model interpretability. In genomics, features are typically represented as one-dimensional measurements (1×1 vectors). Feature modeling can be achieved using methods such as Multi-Layer Perceptron (MLP),²⁸ Self-Normalizing Networks (SNN),²⁹ and DeepSurv.³⁰ Among these, DeepSurv serves as a foundational model for deep neural network–based survival prediction. By integrating the Cox proportional hazards model with deep neural networks, it enables the learning of nonlinear risk functions, enhancing the expressive power and flexibility of survival prediction. Although these single modality methods have achieved substantial progress in feature extraction and risk modeling, they characterize tumor features only from a single dimension. This limitation hinders a comprehensive understanding of the complex mechanisms underlying cancer development and progression, constraining further improvements in predictive performance.

2.2. Survival prediction from multiple modalities

To overcome the limitations of single modality methods, increasing research in recent years has explored the joint modeling of pathological images and genomic profiles. This line of work aims to fully leverage the complementary information within multimodal datasets, thereby enhancing the accuracy of cancer survival prediction and improving model generalization capability. Cao et al.¹⁶ proposed PORPOISE, introducing bilinear pooling into multimodal fusion. By modeling higher-order interactions, it explicitly captures genotype–phenotype associations; however, its fusion strategy remains globally biased, with insufficient emphasis on locally critical regions. To further refine cross-modal alignment precision, Xu et al.³¹ introduced the MOTCat framework, which establishes fine-grained correspondences between pathological and genomic features through optimal transport matching and global structural consistency constraints, yielding more accurate and reliable cross-modal mappings. Liu et al.³² proposed the Mutual-Guided Cross-Modal Transformer (MGCT), which simulates genotype–phenotype interactions within the tumor microenvironment to enhance the consistency and contextual relevance of cross-modal representations. Zhou and Chen¹⁹ proposed the Cross-modal Translation and Alignment (CMTA) framework to explore intrinsic cross-modal correlations and extract latent complementary information. Yang et al.³³ introduced MMsurv, which combines bilinear pooling with a transformer architecture to effectively integrate diverse data types, thereby strengthening the feature expression capability of predictive models. Although the aforementioned methods have achieved notable results in multimodal survival analysis, the cross-modal interaction process may still lead to the loss or redundancy of inherent valuable information.^34,35 To address this issue, we propose an improved cross-modal fusion strategy based on the SAMambar,³⁶ aiming to more comprehensively explore potential cross-modal correlations and complementarities, thereby enhancing the accuracy of survival analysis.

2.3. Mamba

In recent years, State Space Models (SSMs) have demonstrated significant advantages in handling dynamic systems and modeling long-range dependencies, gradually gaining widespread application in medical image analysis and multimodal learning. Yang et al.³⁷ proposed the MambaMIL model, which integrates Mamba into a MIL framework. By employing a sequence reordering strategy, it captures long-range dependencies among dispersed instances. This method effectively mitigates overfitting and computational overhead while maintaining linear complexity, thereby enhancing the model’s ability to capture key discriminative features. Dang et al.³⁸ proposed the LoG-VMamba model, which further integrates local and global feature modeling. It outperformed CNN and Transformer-based baselines in various 2D and 3D medical image segmentation tasks, validating Mamba’s potential for modeling long-range dependencies and spatial consistency. For survival prediction, Chen et al.³⁹ proposed SurvMamba, a multi-granularity and multimodal interaction model. This model comprises two modules: Hierarchical Interaction Mamba (HIM) and Interaction Fusion Mamba (IFM). HIM captures correlations among features at different levels of granularity, while IFM facilitates the fusion of cross-modal interactions. This dual-module architecture enhances feature representation at both fine-grained and global levels, thereby significantly improving the accuracy and efficiency of multimodal survival prediction. Song et al.⁴⁰ proposed The DSCASurv framework, which combines the local feature extraction capability of convolutional layers with the long-range dependency modeling ability of Mamba. This method captures intrinsic correlations between pathology and genomics during cross-modal fusion and alignment, further improving multimodal survival prediction performance. In summary, Mamba provides a novel paradigm for modeling complex medical data through its exceptional long sequence modeling capability and structural flexibility. We construct a parallel encoder-decoder architecture based on Mamba to fully capture global dependencies and discriminative features within modalities, thereby enhancing the expressive power of multimodal representations and improving survival prediction performance.

3. Method

In this section, we systematically introduce the proposed multimodal survival prediction framework, MCAMamba, as shown in Figure 1. Section 3.1 briefly reviews the fundamentals of state space models and their variant, Mamba. Section 3.2 introduces the survival prediction objective. Section 3.3 presents the data processing and feature extraction methods for both modalities. Section 3.4 describes the pathology encoder and genomics encoder. Section 3.5 details the Bidirectional Cross-Attention module. 3.6 describes the pathology decoder and genomics decoder. Section 3.7 discusses feature fusion and survival prediction.

Figure 1.

Framework of the MCAMamba method. (1) Segment the WSI into patches and use the pre-trained feature extractor CTransPath to obtain representative pathological features. (2) Perform gene enrichment analysis on genomic profiles to identify biologically enriched pathways and generate pathway level features. (3) Construct a parallel encoder-decoder architecture, embedding a Bidirectional Cross-Attention module to explore intrinsic cross-modal correlations and transmit latent cross-modal information. Finally, aggregate features within each modality using SAP and feed the fused features into a MLP for final survival prediction.

3.1. Preliminaries

State Space Model. In recent years, state space models have demonstrated significant advantages in modeling long sequences. Their core concept involves describing the evolution of input sequences through dynamic equations of latent variables. SSMs are typically regarded as continuous linear time-invariant systems. They map a one-dimensional input signal $x (t) \in R$ to an output signal $y (t) \in R$ via a latent state $h (t) \in R^{N}$ . This process satisfies the following ordinary differential equations:

h^{'} (t) = Ah (t) + Bx (t)

(1)

y (t) = C h (t)

(2)

where

A \in R^{N \times N}

denotes the state matrix, and

B, C \in R^{N}

represent the input and output projection matrices.

To adapt this model for deep learning systems, the Structured State Space Sequence Model (S4) discretizes the continuous system. Using the Zero Order Hold method and introducing a time step Δ, the continuous parameters A and B are transformed into discrete parameters $\bar{A}$ and $\bar{B}$ :

\bar{A} = \exp (Δ A)

(3)

\bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) \cdot Δ B

(4)

After discretization, the discrete version with step size Δ can be expressed as:

h^{'} (t) = \bar{A} h (t) + \bar{B} x (t)

(5)

y (t) = C h (t)

(6)

Finally, the SSM model utilizes a global convolution to compute the output.

\bar{K} = (C \bar{B}, C \bar{A} \bar{B}, . . ., C {\bar{A}}^{L - 1} \bar{B})

(7)

y = x * \bar{K}

(8)

where L denotes the length of the input sequence, and

\bar{K}

is the structured convolution kernel.

Mamba. Although S4 demonstrates strong performance in sequence modeling, its fixed parameters A, B, and C limit the model’s adaptability to dynamic inputs. To address this limitation, Mamba¹¹ introduces an input-dependent parameter selection mechanism. The parameters B(x), C(x), and Δ(x) are adaptively generated from the input sequence x through linear transformations, as follows:

\bar{B} = f_{B} (x_{t})

(9)

\bar{C} = f_{C} (x_{t})

(10)

Δ = θ_{A} (P + f_{A} (x_{t}))

(11)

Here,

f_{B} (x_{t})

f_{C} (x_{t})

, and

f_{A} (x_{t})

are linear functions.

This mechanism allows the model to dynamically adjust the state evolution process according to different inputs, thereby more effectively modeling long-range dependency information with linear time complexity. In this study, Mamba is employed to model genomic pathway sequences and pathological patch sequences, providing efficient representations with long-range dependency structures for subsequent multimodal fusion.

3.2. Survival prediction objective

In multimodal survival analysis tasks, each patient’s sample data is represented as a quadruple $Z_{i} = (P_{i}, G_{i}, s_{i}, t_{i})$ , where $P_{i} \in R^{N_{H} \times d}$ denotes the feature set of pathological images, $G_{i} \in R^{N_{P} \times d}$ denotes the feature set of genomic profiles, $s_{i} \in {0, 1}$ indicates review status (1 for deceased, 0 for censored), and $t_{i} \in R^{+}$ represents the total survival time (in months). Assuming T is a continuous random variable representing a patient’s actual survival time, the model aims to estimate the hazard function $f_{h a z a r d} (T = t | T \geq t, Z)$ at time t. This function quantifies the instantaneous probability of an event (e.g., death) occurring at time t, given that the patient has survived up to that time. It is defined as:

f_{h a z a r d} (T = t) = \lim_{Δ t \to 0} \frac{P (t \leq T \leq t + Δ t | T \geq t)}{Δ t}

(12)

Since accurately predicting a patient’s exact survival time is challenging, we employ discrete-time modeling to estimate the probability of surviving beyond a set of discrete time points. Based on the hazard function, the cumulative survival function $f_{s u r} (T \leq t | Z)$ can be expressed as:

f_{s u r} (T \leq t, Z) = \prod_{u = 1}^{t} (1 - f_{h a z a r d} (T = u)) (13)

3.3. Data processing and feature extraction

Pathological images. For the pathological mode, we first adopted the CLAM¹⁵ method to segment tissue regions in each whole slide image. At a magnification of 20×, we cropped $N_{H}$ non-overlapping 256×256 pixel patches, denoted as $X_{p} = {h_{1}, h_{2}, . . ., h_{N_{H}}} \in R^{N_{H} \times d}$ . Subsequently, each patch underwent feature extraction using the pre-trained Swin Transformer network(CTransPath),⁵ yielding a 768-dimensional patch feature vector. These features were then uniformly mapped to a D-dimensional embedding space through a fully connected layer to ensure dimensional consistency and facilitate subsequent modeling.

Each WSI typically contains thousands or even tens of thousands of patches, with dense distributions and highly redundant features. Directly processing all patches would incur extremely high computational costs and increase model instability. To address this issue, we introduce a Patch Clustering Layer (PCL)³⁶ to extract representative key regions from the extensive patch feature set. Specifically, the PCL module maintains K trainable cluster centers ${c_{1}, c_{2}, . . ., c_{K}} \in R^{K \times d}$ . During each forward pass, it calculates the Euclidean distance between each input patch feature and all cluster centers, selecting the patch with the smallest distance to each center as its representative:

{\hat{x}}_{k} = \arg \min_{x_{i} \in X_{p}} {‖ x_{i} - c_{k} ‖}_{2}, k = 1, 2, . . ., K

(14)

Through the aforementioned process, K representative cluster features are selected from the original large-scale patch collection to construct patch feature sequences for pathological images. The PCL module compresses the original high-dimensional patch sequence into a representative patch feature sequence of length K, effectively reducing computational complexity for the subsequent pathology encoder and Bidirectional Cross-Attention Module. Ultimately, each patient’s pathological features can be formally represented as $X_{p} = {{\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{K}} \in R^{K \times d}$ , where K denotes the number of retained patches and d represents the feature dimension of each patch.

Genomic profiles. Genomic profiles, as vital carriers of individual genetic characteristics, play a pivotal role in disease prognosis prediction. Following previous studies,³⁶ we utilized biological pathway information from the KEGG database to perform gene set enrichment analysis⁴¹ (GSEA) on genomics data, including gene expression (RNA-seq), copy number variations, and mutation status. Specifically, we conducted GSEA separately for each cancer type, selecting pathway sets most significantly associated with differentially expressed genes. Only pathway subsets meeting the threshold of FDR ≤ 0.05 were retained to ensure the biological relevant of selected pathways.

Subsequently, we employed SNN²⁹ to perform embedding mapping on the filtered pathway features, uniformly projecting pathway features of different dimensions into a low-dimensional space to obtain more discriminative pathway representations. Ultimately, each patient’s genomic features are represented as: $X_{g} = {g_{1}, g_{2}, . . ., g_{N_{P}}} \in R^{N_{P} \times d}$ , where $N_{P}$ denotes the total number of pathways and d represents the embedding dimension of each pathway.

3.4. Pathology encoder and genomics encoder

Given that both pathological images and genomic profiles exhibit characteristics such as high dimensionality and long-range dependencies,^15,32 we leverage Mamba’s efficient long sequence modeling capability to construct parallel encoder–decoder architectures. This design captures global dependencies and discriminative features within each modality. Specifically, the structure of the pathology decoder aligns with that of the genomics encoder, while the genomics decoder mirrors the structure of the pathology encoder.

Pathology Encoder. Patch sequences in pathological images contain not only local morphological features but also cross-regional contextual dependencies and spatial structural relationships.⁸ To effectively capture this information, we introduce the Spatial Reversible Mamba (SR-Mamba) module³⁷ as the core component of the pathology encoder, whose architecture is illustrated in Figure 2. The SR-Mamba module employs a dual-branch parallel modeling mechanism that separately focuses on temporal dependencies and spatial structural relationships within patch sequences. Through a gating mechanism, it dynamically modulates and fuses features, thereby enhancing the representation of key pathological regions such as tumor areas, boundary structures, and the microenvironment.

Figure 2.

Structural diagram of SR-Mamba.

To further elucidate the working mechanism of SR-Mamba, its specific implementation steps are detailed below.

Step 1: Sequence Partitioning. Given a pathological patch feature sequence $X_{p} \in R^{K \times d}$ , to enhance local context capture, we partition $X_{p}$ into non-overlapping segments of length R, yielding N=K/R subsequences. If sequence length K is not divisible by R, zero-padding is applied to the end of the sequence to ensure proper segment ordering in subsequent operations. For simplicity, the padded sequence is denoted as $P \in R^{K \times d}$ .

Step 2: Dual-Branch Modeling. The partitioned sequences are simultaneously processed by two complementary branches: the temporal modeling branch and the spatial rearrangement branch.

(1) Temporal Modeling Branch. This branch preserves the original sequence order P. It captures long-range dependencies using a casual convolutional layer and a state space model, producing the output representation Y:

Y = SSM (SiLU (Conv 1 D (Linear (P))))

(15)

To achieve dynamic modulation, we generate a gating signal from the input sequence:

Z = S i L U (L i n e a r (P))

(16)

where

Z \in R^{K \times d}

represents gating weights with the same shape as the input sequence.

Subsequently, the generated gating signal Z is used to perform weighted modulation on Y, yielding the discriminative features for the temporal branch:

P^{'} = Z ⊙ Y

(17)

where ⊙ denotes element-wise multiplication, and

P^{'} \in R^{K \times d}

represents the gated temporal output.

(2) Spatial Rearrangement Branch. This branch reconstructs the sequence $P \in R^{K \times d}$ into a two-dimensional map $X_{2 d} \in R^{R \times N \times d}$ and applies a fragment rearrangement strategy to form new instance feature sequence $P_{r} {\in R}^{K \times d}$ . The rearranged sequence $P_{r}$ is then input into the SSM module to model spatial dependencies, producing the output representation $Y_{r}$ :

Y_{r} = SSM (SiLU (Conv 1 D (Linear (P_{r}))))

(18)

Consistent with the Temporal Modeling Branch, we similarly use the generated gating signal Z to weight the reordered output features $Y_{r}$ :

{P_{r}}^{'} = Z ⊙ ψ (Y_{r})

(19)

where ψ(⋅) denotes the sequence restoration operation, which restores the reordered sequence to its original arrangement for positional alignment with

P^{'}

, and

{P_{r}}^{'} \in R^{K \times d}

represents the gated spatial-consistency features.

Step 3: Feature Fusion. The discriminative instance features $P^{'}$ and ${P_{r}}^{'}$ from both branches are fused through additive integration and a linear transformation to obtain the final pathological modality representation:

H_{p} = L i n e a r (P^{'} + {P_{r}}^{'}) + P

(20)

The dual-branch architecture of SR-Mamba preserves cross-regional contextual dependencies while maintaining the spatial consistency of local organizational structures. The final pathological feature $H_{p}$ is then fed into the Bidirectional Cross-Attention module, providing more discriminative features for multimodal feature fusion.

Genomics Encoder. Genomic profiles typically manifest as high-dimensional sequences with complex regulatory relationships and long-range dependencies among pathways.⁴² To effectively capture these relationships, we employ Mamba as the genomics encoder, whose structure is illustrated in Figure 3. Mamba decomposes pathway sequences into convolutional and gated branches: the convolutional branches enhance local feature representation, while the gated branches employ dynamic modulation mechanisms to adaptively model long-range dependencies.

Figure 3.

Structural diagram of Mamba.

The specific process is as follows: Given a pathway feature sequence $X_{g} \in R^{N_{P} \times d}$ , the input features are first decomposed into two channels, x and z via linear mapping. Channel x undergoes one-dimensional deep convolution and nonlinear activation to enhance local representations:

x^{'} = σ (Conv 1 D (Linear (X_{g})) .

(21)

Subsequently, the state space parameters are dynamically generated based on x' and fed into the SSM to model cross-channel long-range dependencies:

Y = SSM (x^{'}; B (x^{'}), C (x^{'}), Δ (x^{'})) .

(22)

The output feature Y is then weighted by the gate signal z:

z = σ (Linear (G)), G^{'} = z ⊙ Y .

(23)

where ⊙ denotes element-wise multiplication.

Finally, the genomic modality output representation is obtained through a linear transformation:

H_{g} = L i n e a r (G^{'})

(24)

This genomics encoder efficiently captures cross-pathway long-range dependencies while preserving the discriminative features of local pathways features. We denote the genomic feature output $H_{g}$ , which serves as input for subsequent Bidirectional Cross-Attention module.

3.5. Bidirectional cross-attention module

Pathological images and genomic profiles reflect tumor characteristics at the levels of histological morphology and molecular mechanisms, respectively. During cancer progression, morphological alterations in specific regions of pathological images are often accompanied by abnormal expression of related genes.^43,44 Based on this observation, we propose a Bidirectional Cross-Attention module to explicitly model bidirectional dependencies between pathological and genomic features. This module establishes mutually guided attention pathways to achieve dynamic semantic alignment and feature complementarity between the two modalities. It further uncovers intrinsic correlations between pathology and genomics, enabling mutual guidance and bidirectional transmission of cross-modal information.

Specifically, we first concatenate pathological features $H_{p} \in R^{K \times d}$ and genomic features $H_{g} \in R^{N_{P} \times d}$ along the token dimension to construct a unified feature sequence: $X_{c a t} = [H_{g}; H_{p}] \in R^{(N_{P} + K) \times d}$ . This operation provides a shared contextual representation space for the bidirectional cross-attention mechanism, facilitating dynamic cross-modal mapping and effective information alignment between modalities.

Within the BCA, each modality serves as the query vector, while the other modality’s features act as key and value vectors, enabling bidirectional information exchange:

{Q u e r y}_{g} = X_{g}, {K e y - V a l u e}_{p} = X_{p}

(25)

{Q u e r y}_{p} = X_{p}, {K e y - V a l u e}_{g} = X_{g}

(26)

The cross-attention process from pathological modality to genomic modality (Path→Gen) is:

X_{p \to g} = S o f t m a x (\frac{Q_{p} K_{g}^{T}}{\sqrt{d}}) V_{g}

(27)

The cross-attention process from the genomic modality to the pathological modality (Gen→Path) is:

X_{g \to p} = S o f t m a x (\frac{Q_{g} K_{p}^{T}}{\sqrt{d}}) V_{p}

(28)

where Q, K, and V represent the query, key, and value features obtained via linear projection.

Through this bidirectional cross-attention mechanism, the BCA enables two-way information exchange and complementary data transmission between modalities. This allows the pathological modality to focus spatially on regions associated with key pathways, while the genomic modality acquires more contextually meaningful representations, thereby facilitating deep interactions between pathological and genomic characteristics.

3.6. Pathology decoder and genomics decoder

To further map the cross-modal features obtained from bidirectional cross-attention into representations consistent with the target modality’s semantic space, we introduce two modality decoders after the bidirectional cross-modal interaction module. This enables bidirectional information flow and semantic complementarity during the decoding phase, thereby enhancing the alignment and fusion of cross-modal features. Specifically, the pathology decoder takes genome-guided pathological features $X_{g \to p}$ as input to generate cross-modal pathological representations:

X_{p}^{'} = Mamba (X_{g \to p})

(29)

Here, $X_{p}^{'}$ denotes the semantically consistent pathological representation generated under genomic feature guidance. This maps molecular-level information onto the pathological representation space, thereby strengthening cross-modal alignment and highlighting prognostic-relevant pathological features.

Correspondingly, the genome decoder takes the pathology-feature-guided genomic feature $X_{p \to g}$ as input to generate a cross-modal genomic representation:

X_{g}^{'} = SRMamba (X_{p \to g})

(30)

Here,

X_{g}^{'}

denotes the representation generated under pathological feature guidance that aligns with genomic semantics. It is used to enhance the modeling of gene pathway representations associated with tissue morphology, thereby improving cross-modal consistency.

In the structural design, the pathology decoder adopts the Mamba architecture consistent with the genome encoder, while the gene decoder employs the SR-Mamba architecture aligned with the pathology encoder. This ensures symmetric representational capacity in both Gen→Path and Path→Gen directions, reducing directional bias and enhancing alignment stability. The decoder outputs $X_{p}^{'}$ and $X_{g}^{'}$ , providing more complementary cross-modal features for subsequent self-attention pooling and survival prediction.

3.7. Feature fusion and survival prediction

To further focus on key tissue regions and genomic pathways strongly associated with disease, we feed the decoded pathological features $X_{p}^{'}$ and genomic features $X_{g}^{'}$ into a Self-Attention Pooling³⁶ module. SAP introduces a learnable global query vector that adaptively assigns different weights to each token feature, thereby more accurately highlighting pathological regions or genomic pathways relevant to patient survival. Ultimately, the sequences from each modality are aggregated into discriminative and semantically complete global representations:

Z_{p} = S e l f A t t P o o l (X_{p}^{'})

(31)

Z_{g} = S e l f A t t P o o l (X_{g}^{'})

(32)

Subsequently, the global features from pathological images and genomic data are concatenated to form a unified multimodal representation:

X = [Z_{p}; Z_{g}] \in R^{1 \times d}

. Finally, a Multi-Layer Perceptron is employed to learn a nonlinear risk function for survival risk modeling.

To optimize model training, we adopt the Negative Log-Likelihood loss with censoring, a widely used objective function in survival analysis. This loss maximizes the likelihood of patient survival while appropriately handling censored data, thereby improving predictive accuracy. The loss function is defined as follows:

L_{sur v} = - clog (f_{s u r} (t | X)) - (1 - c) \log (f_{s u r} (t - 1 | X)) - (1 - c) \log (f_{h a z a r d} (t | X))

(33)

where X represents the fused multimodal features and c denotes the review status.

4. Experimental results and analysis

In this section, we conduct extensive experiments on four public TCGA datasets to evaluate the effectiveness of our method. First, we introduce the datasets, evaluation metrics and implementation details used in our study. Next, we compared our method with several state-of-the-art methods to demonstrate its superiority. Then, we perform ablation studies to evaluate the contribution of each key module in our method. Finally, we provide visual statistical analysis and interpretation to verify the validity and reliability of the proposed method in survival prediction.

4.1. Datasets and settings

Datasets. To validate the effectiveness of MCAMamba, we conducted extensive experiments on four publicly available cancer datasets from The Cancer Genome Atlas (TCGA), each containing paired whole-slide images, corresponding survival outcomes, and high-throughput genomic profiles. The datasets include 373 cases of bladder urothelial carcinoma (BLCA), 957 cases of breast invasive carcinoma (BRCA), 453 cases of lung adenocarcinoma (LUAD), and 480 cases of uterine corpus endometrial carcinoma (UCEC).Table 1 summarizes key dataset characteristics, where “Patient” denotes the number of cases, “WSI” denotes the number of input WSI images, “Gene” denotes the number of input genes, “Pathway” denotes the number of filtered pathways, “Time” denotes the maximum follow-up duration (months), and “Censored” represents the proportion of censored patients in the dataset.

Table 1.

Dataset statistics across different cancer types.

Cancer Type	Patient	WSI	Gene	Pathway	Censored	Time
BLCA	373	437	20,395	285	0.453	163.2
BRCA	957	1023	20,971	284	0.136	282.7
UCEC	480	539	9072	284	0.156	225.3
LUAD	453	516	21146	188	0.349	238.1

Evaluation Metrics. To comprehensively evaluate predictive performance, we adopted a 5-fold cross-validation strategy and compared the results with those of existing baseline methods. The concordance index (c-index) was used to assess the predictive accuracy of risk scores in survival prediction. This metric measures the agreement between the model’s predicted risk ranking and actual survival times, with higher values indicating stronger predictive accuracy. To visually assess risk stratification, Kaplan-Meier curves were plotted for high-risk and low-risk groups, and differences between groups were evaluated using the log-rank test. A p-value < 0.05 was considered statistically significant.

Implementation Details. During the pathological image feature extraction, we used the pre-trained CTransPath model to encode features from whole slide images, producing 768-dimensional feature representations for each image patch. These features were then mapped to 256-dimensional embedding using a fully connected layer. To reduce redundancy and enhance semantic representation, the PCL module was applied to cluster patch features into 256 representative clusters as the pathological modality input sequence. For genomic data, enrichment analysis was performed using KEGG biological pathway information to generate pathway-level expression matrices, which were subsequently embedded into 256-dimensional feature vectors using SNN. The SSM dimension was set to 16 for both Mamba and SR-Mamba modules. All experiments were implemented in PyTorch 2.0.1 on a single NVIDIA GeForce RTX 4090 GPU. Training used the RAdam optimizer with a learning rate of 2e-4, weight decay of 1e-5, batch size of 1, and 30 epochs. Given the large number of patch features in each patient’s pathological images and the substantial GPU memory overhead incurred by multimodal joint modeling, the batch size was set to 1 in this study.

4.2. Comparison with state-of-the-art methods

To comprehensively evaluate the proposed multimodal survival prediction method MCAMamba, we compared it with a variety of existing methods using a 5-fold cross-validation strategy on four TCGA cancer datasets (BLCA, BRCA, UCEC, and LUAD). The comparison included both single-modal methods and multi-modal methods.

(1) Single-modal methods.

To assess the predictive capability of each individual modality, we conducted single-modal comparison experiments on pathological images and genomic data separately. For the genomic modality, we selected three representative baseline models: MLP,²⁸ SNN,²⁹ and DeepSurv.³⁰ For pathological images, we compared several representative MIL methods, including DeepSets,²⁵ AttentionMIL,²⁶ DeepAttnMISL,²⁷ and TransMIL.⁸ As shown in Table 2, MCAMamba outperformed all single-modality methods on the four datasets, achieving optimal survival prediction performance. Compared with genomic unimodal methods, the average c-index improvement ranged from 15.4% to 18.9%; compared with pathology unimodal methods, improvements ranged from 18.2% to 38.1%. These results demonstrate that relying solely on a single modality fails to capture the complex phenotype–molecular dependencies in cancer. Whereas integrating histological and genomic information substantially, MCAMamba markedly enhances survival prediction performance and model generalization.

(2) Multi-modal methods.

Table 2.

Performance comparison of different methods across four TCGA datasets, with the best performer highlighted in bold.

Methods	Patho.	Geno.	BLCA	BRCA	UCEC	LUAD	Overall
MLP		✓	0.577±0.041	0.606±0.072	0.608±0.045	0.573±0.021	0.5910
SNN		✓	0.598±0.034	0.600±0.066	0.638±0.054	0.579±0.027	0.6038
DeepSurv		✓	0.578±0.049	0.612±0.054	0.634±0.058	0.608±0.026	0.6090
DeepSets	✓		0.500±0.000	0.503±0.005	0.524±0.027	0.509±0.022	0.5090
AttentionMIL	✓		0.596±0.036	0.592±0.015	0.629±0.061	0.563±0.027	0.5950
DeepAttnMISL	✓		0.524±0.043	0.504±0.042	0.597±0.059	0.548±0.050	0.5433
TransMIL	✓		0.530±0.052	0.633±0.043	0.600±0.049	0.527±0.012	0.5725
MCAT	✓	✓	0.616±0.028	0.607±0.072	0.642±0.098	0.590±0.062	0.6138
PORPOISE	✓	✓	0.595±0.024	0.546±0.077	0.645±0.066	0.598±0.041	0.5960
MGCT	✓	✓	0.612±0.030	0.637±0.062	0.617±0.076	0.629±0.046	0.6237
MOTCat	✓	✓	0.659±0.012	0.674±0.029	0.718±0.035	0.668±0.019	0.6798
CMTA	✓	✓	0.656±0.029	0.648±0.044	0.713±0.053	0.652±0.044	0.6673
SAMamba	✓	✓	0.654±0.015	0.644±0.032	0.742±0.023	0.668±0.015	0.6768
SurMoE	✓	✓	0.672±0.020	0.660±0.043	0.731±0.052	0.681±0.032	0.686
MCAMamba	✓	✓	0.682±0.039	0.676±0.019	0.762±0.061	0.692±0.037	0.7030

To validate the effectiveness of our model, we compared MCAMamba with seven state-of-the-art multimodal survival prediction models that share the same task definition as this paper and provide publicly available implementations, including: MCAT,²⁴ PORPOISE,¹⁶ MGCT,³² MOTCat,³¹ CMTA,¹⁹ SAMamba,³⁶ and SurMoE.⁴⁵ As shown in Table 2, MCAMamba achieved the best performance across all comparison methods. Specifically, it outperformed all other models on the BLCA (0.682), BRCA (0.676), UCEC (0.762), and LUAD (0.692) datasets. Compared with existing multimodal methods, MCAMamba achieved average c-index improvements ranging from 2.47% to 17.9%, demonstrating the model’s superiority in multimodal survival prediction. Notably, compared with SAMamba the structurally closest multimodal survival prediction method MCAMamba achieved higher c-index values across four datasets. These results indicate that the proposed MCAMamba effectively embeds a Bidirectional Cross-Attention module within a parallel encoder–decoder architecture to enhance cross-modal alignment and preserve intra-modal features, thereby improving the model’s ability to predict patient survival.

4.3. Ablation studies

To evaluate the contribution of each key module in our method to survival prediction performance, we conducted ablation experiments on four TCGA cancer datasets. We first verified the validity of each module within the model. Then, we explored the impact of the Mamba and the number of cluster centers in PCL modules on model performance.

Component validation: By removing or replacing core components, we analyzed the role of each module in model performance, including the Patch Clustering Layer, SR-Mamba module, Bidirectional Cross-Attention Module, and Self-Attention Pooling. The results of the ablation experiments are presented in Table 3.

(1) Impacts of Patch Clustering Layer: To validate the effectiveness of PCL clustering phenotype, we conducted experimental evaluations by removing this module from the MCAMamba (w/o PCL). Removing PCL resulted in performance declines across all datasets, with the most significant decreases in BLCA (−3.96%) and UCEC (−3.67%). This indicates that the PCL module contributes positively to overall model performance by reducing redundant patches and preserving key regional features.

(2) Impacts of SR-Mamba module: To validate the effectiveness of state-space modeling in SR-Mamba, we replaced the SR-Mamba component in the pathology image encoder with the standard Mamba architecture (Mamba). Replacing SR-Mamba with standard Mamba led to substantial performance degradation across all datasets: BLCA decreased by 5.73%, BRCA by 5.29%, UCEC by 4.09%, and LUAD by 3.12%. This demonstrates that the SR-Mamba module more effectively captures spatial structures and cross-regional dependencies in pathological images, thereby enhancing the representational capability of the pathological modality.

(3) Impacts of Bidirectional Cross-Attention Module: To validate the interactive effects of BCA, we conducted experimental evaluations by removing this module from the MCAMamba (w/o BCA). Removing BCA caused substantial degradation in cross-modal interaction capability. Specifically, the c-index decreased by 11.2% for BLCA, 11.3% for BRCA, 7.9% for UCEC, and 2.5% for LUAD. These results indicate the pivotal role of the BCA in facilitating mutual guidance and fusion between pathological and genomic features, as well as in capturing cross-modal complementary information. Without this mechanism, the model struggles to effectively integrate correlated features across modalities, resulting in significantly reduced predictive performance.

(4) Impacts of Self-Attention Pooling: To validate the interactive effects of SAP, we conducted experimental evaluations by removing this module from the MCAMamba (w/o SAP). Removing SAP led to varying impacts across datasets. UCEC and LUAD showed minor changes (3.67% and 1.61%, respectively), whereas BLCA and BRCA exhibited c-index declines of 7.74% (from 0.682 to 0.633) and 6.28% (from 0.676 to 0.636). This demonstrates that the SAP module enables the model to focus on disease-relevant pathological regions and gene pathways through its adaptive weighting mechanism. When removed, the model’s ability to aggregate local features weakens, leading to diminished risk characterization.

Table 3.

Experimental results after removing the following key components, with the complete MCAMamba model highlighted in bold.

Methods	BLCA	BRCA	UCEC	LUAD	Overall
w/o PCL	0.656±0.021	0.661±0.042	0.735±0.030	0.682±0.047	0.6835
Mamba	0.645±0.019	0.642±0.027	0.732±0.029	0.671±0.028	0.6725
Transformer	0.637±0.042	0.633±0.022	0.720±0.043	0.652±0.024	0.6605
w/o BCA	0.613±0.037	0.607±0.055	0.706±0.078	0.675±0.057	0.6503
w/o SAP	0.633±0.022	0.636±0.039	0.735±0.047	0.681±0.036	0.6713
MCAMamba	0.682±0.039	0.676±0.019	0.762±0.061	0.692±0.037	0.7030

Overall, removing or replacing any core module led to performance degradation, with the removal of the BCA producing the most pronounced effect (a maximum decline of 11.3%). These results validate the critical roles of each component in the MCAMamba model: PCL enhances input feature quality, SR-Mamba strengthens spatial dependency modeling, BCA promotes cross-modal feature alignment and complementarity, and SAP improves key feature aggregation. The synergistic interactions among these modules collectively underpin the model’s superior performance in multimodal cancer survival prediction.

Impact of Mamba on Performance: To evaluate the contribution of Mamba/SR-Mamba as the sequence-model backbone in MCAMamba, we uniformly replaced all Mamba/SR-Mamba encoding and decoding modules with standard Transformer blocks, while keeping the remaining network architecture and training settings unchanged. After this replacement, the model’s c-index decreased across all datasets: BLCA by 7.06%, BRCA by 6.79%, UCEC by 5.83%, and LUAD by 6.13%. These results indicate that Mamba/SR-Mamba plays a crucial role as the sequence-model backbone in this multimodal survival prediction task. Its ability to model pathological and omics sequence features more effectively captures long-range dependencies and promotes cross-modal consistency, leading to stable performance gains. Furthermore, to systematically compare the inference efficiency and computational overhead of Transformers versus Mamba and SR-Mamba, we fixed B=1, D=256, and the layer count L=2 on the BLCA cohort. Under identical precision and hardware conditions, we scanned K∈ {256,512,1024,2048} to evaluate single-inference time (reported as median/p95) across different backbone networks (see Figure 4). At K=2048, we summarized each model’s parameter count and peak GPU Mem (see Table 4). Theoretically, the dominant computation in Transformer self-attention scales as O(K²) with sequence length K, whereas the dominant computation of Mamba exhibits linear O(K) dependence on K. Therefore, we empirically compare their complexities by statistically analyzing latency and peak GPU Mem across the K scans.

Figure 4.

The trend of single inference time with respect to different backbone networks. (a) Median latency curve. (b) p95 latency curve.

Table 4.

Efficiency and resource usage (K=2048, batch=1).

Type	Params(M)	Peak GPU Mem (MB)	Latency median/p95(ms)
Mamba	0.876	456.91	0.45/0.46
SR-Mamba	0.965	475.79	0.95/0.98
Transformer	1.580	513.32	1.20/1.21

Impact of the number of Cluster Centers: By default, we set the number of cluster centers K in the PCL module to 256. To evaluate the effect of K on model performance, we conducted comparative experiments with K∈ {32,64,128,256,512}, while keeping all other training configurations consistent. The experimental results are shown in Figure 5. As show in Figure 5, the model achieves the best c-index when K=256. Therefore, we adopt K=256 as the default setting in the PCL module.

Figure 5.

Performance under different numbers of cluster centers K, particularly in terms of the c-index.

4.4. Survival analysis

To evaluate the effectiveness of the MCAMamba model in patient risk stratification, we performed Kaplan-Meier curves based on the model-predicted risk scores, as shown in Figure 6. Specifically, patients were divided into high-risk and low-risk groups based on the median predicted risk score; to ensure a fair comparison, all baseline models employed the same median stratification strategy. The Kaplan-Meier curves were used to estimate the cumulative survival probability of patients in different risk groups. Survival differences between groups were assessed using the log-rank test, with the significance level set to α=0.05 (all p-values in Figure 6 originate from this test). The stratification and separation trends observed in the Kaplan-Meier curves demonstrate that, across the four cancer datasets (BLCA, BRCA, UCEC, and LUAD), MCAMamba exhibits a more pronounced separation between high-risk and low-risk groups than the other two competitive multimodal methods. The log-rank test results indicate statistically significant survival differences between the high-risk and low-risk groups for all datasets. On the BLCA dataset, the log-rank test p-values were 4.3989e-12 (SAMamba), 1.8946e-05 (SurMoE), and 5.4127e-16 (MCAMamba), with MCAMamba demonstrating the most significant intergroup differences. On the BRCA dataset, the p-values were 1.3786e-07 (SAMamba), 1.0723e-06 (SurMoE), and 5.3560e-07 (MCAMamba), all statistically significant. On the UCEC dataset, the p-values were 1.7847e-11 (SAMamba), 1.6692e-05 (SurMoE), and 3.4375e-16 (MCAMamba), again indicating a more pronounced stratification effect for MCAMamba. On the LUAD dataset, the p-values were 3.1704e-09 (SAMamba), 4.3872e-09 (SurMoE), and 5.6824e-17 (MCAMamba), again demonstrating that MCAMamba achieved the most significant survival differences between the high-risk and low-risk groups on this dataset. These results further validate MCAMamba’s ability to effectively distinguish patient cohorts with differing prognostic risks, demonstrating stable and competitive risk stratification capabilities across multimodal survival prediction tasks.

Figure 6.

Kaplan-Meier curves and log-rank test results.

4.5. Interpretability

To further investigate the interpretability of the model’s predictions, we conducted a visual analysis of the basis of the model’s predictions from three perspectives: pathological regions, genomic pathways, and cross-modal correlations. As shown in Figure 7, at the pathological level, the model mainly attends to regions with abundant stromal components, disrupted glandular structures, and marked cellular atypia in the high-risk case, indicating that its risk assessment relies on morphological cues associated with aggressive growth and disorganized tissue architecture. In contrast, in the low-risk case, the model focuses on regions with more intact glandular structures and more continuous tumor epithelial arrangements, suggesting that it relies more on areas with better differentiation and more stable tissue architecture. These results indicate that the model can capture key histological patterns associated with different prognostic states, thereby providing a pathological basis for risk prediction. At the pathway level, the top pathways in the high-risk case included Autophagy-animal, Phosphatidylinositol signaling system, MAPK signaling pathway, Endocytosis, and Endometrial cancer, reflecting enhanced tumor cell survival, activation of oncogenic signaling, and processes related to invasion and migration. By contrast, the top pathways in the low-risk case mainly included the GnRH signaling pathway, mTOR signaling pathway, Cellular senescence, and TGF-beta signaling pathway. Compared with the high-risk case, these pathways were more dispersed and were generally associated with endocrine regulation, maintenance of cellular state, and growth regulation, rather than with invasion, migration, and stress-survival characteristics. Overall, the model can distinguish distinct pathway patterns across risk groups: high-risk cases are characterized by stronger tumor progression-related pathways, whereas low-risk cases exhibit a more dispersed and relatively regulatory pathway pattern. This observation is consistent with the top-patch analysis, indicating good agreement between the model’s pathway-level and histopathological interpretations.

Figure 7.

Interpretability and attention visualization.

From the perspective of cross-modal correlations, we selected the same four representative pathways for visualization in both groups: MicroRNAs in cancer⁴⁶(human diseases); the MAPK signaling pathway⁴⁷(signal transduction); focal adhesion⁴⁸(cellular processes); and FoxO signaling pathway⁴⁹(Signal Transduction). These maps depict the distribution of cross-modal attention weights assigned by the model to different pathological regions under the given pathway conditions. All attention values underwent min-max normalization within each WSI (range [0, 1]) to enhance spatial contrast within the same sample. Consequently, color intensity primarily reflects the relative spatial distribution within a sample rather than absolute numerical comparisons between cases. A unified turbo color-map is employed, where warm colors (red/orange) denote relatively high attention weights and cool colors (blue) indicate relatively low weights. Figure 7 reveals distinguishable spatial weight distribution patterns between the two case groups. High-risk cases exhibit more extensive distributions of medium-to-high attention intensity for the MAPK signaling and MicroRNAs in cancer pathways, whereas high-attention regions in low-risk cases are relatively more localized. This discrepancy suggests that the model learns distinct pathway-morphology association patterns across risk states. Additionally, the Focal adhesion pathway shows relatively clustered high-weight distributions in both case groups. The FoxO signaling pathway also exhibits variations in weight intensity and spatial distribution across cases, indicating that the model does not assign uniform attention patterns to all pathways but instead dynamically adjusts cross-modal weight allocation based on sample characteristics. The small window patches at the bottom of the figure correspond to high-weight regions in each pathway heat-map, providing local histomorphology information. This allows the process of “which regions the model focuses on under which pathway conditions” to be visualized intuitively. It is crucial to emphasize that attention weights reflect the correlation strengths learned by the model during multimodal fusion rather than the actual biological activation levels of molecular pathways. Overall, these visualizations reveal distinct spatial attention patterns for pathways across different risk phenotypes, demonstrating that MCAMamba can learn structured molecular-morphological correlation features, thereby providing interpretable support for risk stratification.

5. Discussion

Existing multimodal survival prediction methods still face two major challenges: efficiently capturing global dependencies in high-dimensional long-sequence features while maintaining linear complexity, and preserving informative modality-specific representations while enabling effective cross-modal interaction. To address these challenges, we propose MCAMamba, a multimodal method with bidirectional cross-attention and a state space model for cancer survival prediction. This method aims to extract both intrinsic intra-modal information and complementary inter-modal information from pathological images and genomic profiles, enabling efficient modeling and fusion of cross-modal features. Specifically, for the pathology modality, we employ PCL to cluster pathological images and select representative regions. Subsequently, by integrating the pathology encoder and decoder, the model captures long-range dependencies and spatial structural consistency within the sequence of pathology patches while maintaining computational efficiency. For the genomic modality, gene enrichment analysis is performed on genomic data for each cancer type using biological pathway information from the KEGG database. Subsequently, by combining the gene encoder and decoder to model pathway-level gene sequences, the model captures dependencies in complex molecular mechanisms and ensures biological interpretability. Building upon this foundation, we embed a Bidirectional Cross-Attention module within the encoder–decoder architecture. This framework explores intrinsic correlations between pathology and genomics, enabling bidirectional information exchange and complementary feature transfer across modalities. Finally, self-attention pooling aggregates global representations from all modalities, yielding more accurate survival predictions.

To validate the effectiveness of the MCAMamba, we conducted extensive experiments on four public TCGA cancer datasets. The results show that MCAMamba significantly outperforms multiple representative baseline models in survival prediction tasks, fully demonstrating its superior performance in multimodal cancer survival prediction. To further assess the contribution of key modules to predictive performance, we designed ablation experiments to systematically analyze the roles of patch clustering, the SR-Mamba encoder, the BCA module, and the SAP. However, the current results primarily demonstrate the validity of the model within the TCGA dataset. Independent external validation datasets have not yet been incorporated, and cross-cohort evaluation has not been performed. In addition, Kaplan–Meier survival curves were used to visualize the risk stratification results, which showed that MCAMamba has strong risk stratification capability. We also conducted an interpretability analysis of the model’s predictions from three perspectives based on representative high-risk and low-risk cases. However, this interpretability analysis remains preliminary and lacks systematic case-level evaluation. In future work, we will further assess the model’s clinical utility and generalizability through specific case studies and validation on independent external datasets.

Although the method proposed in this study has achieved significant results in multimodal cancer survival prediction, it currently relies primarily on pathological image representations at a single resolution. This limitation partially constrains the model’s ability to fully exploit the inherent pyramidal structure of WSIs, thereby affecting the accuracy of survival predictions. WSIs at different magnifications present hierarchical microscopic views, revealing multi-scale pathological information ranging from tissue phenotypes (5×) to cellular structures (20×) and even individual cells (40×). Based on this observation, future research will introduce multi-scale modeling strategies to comprehensively extract feature information at different scales and capture complementary morphological patterns, thereby enhancing cross-modal alignment and fusion capabilities between pathological and genomic features.

6. Conclusion

In this study, we propose MCAMamba, a multimodal cancer survival prediction method based on bidirectional cross-attention mechanisms and state space model. The proposed method is designed to efficiently model intra-modal representations of pathological images and genomic profiles while exploring cross-modal correlations and complementary information, thereby improving cancer survival prediction accuracy. Specifically, MCAMamba adopts a parallel encoder–decoder architecture to model pathological images and genomic data separately, extracting inherent valuable feature representations. Building on this architecture, we embed a Bidirectional Cross-Attention module to capture cross-modal correlations between the pathological and genomic modalities, while facilitating bidirectional information exchange and complementary feature transfer. Finally, Self-Attention Pooling aggregates global representations across modalities to achieve more accurate survival prediction. Experimental results on four public TCGA cancer datasets demonstrate that MCAMamba achieves significant improvements in survival prediction, with the c-index increasing by 2.47%- 17.9%.

Footnotes

ORCID iD

Haiyan Cui

Ethical considerations

This study constitutes a secondary analysis of publicly available data obtained from the open-access Cancer Genome Atlas (TCGA) repository. The TCGA project obtained ethics committee approval and informed consent from participants during the data collection phase. This research exclusively uses de-identified public data for analysis, involves no personally identifiable information, and requires no additional participant recruitment or intervention. Therefore, this study typically does not require further institutional ethics review.

Consent to participate

Informed consent was obtained from participants during the data collection phase of the original TCGA study. This study uses only de-identified public data for secondary analysis and does not require additional individual informed consent.

Author contributions

All authors contributed to this work. Writing—original draft, HY.C.; Writing—review and editing, WP. D, WX. W; Collect and process data, XF. L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All data generated or analyzed during this study are included in this article and its supplementary materials.*

Appendix

References

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 2024; 74: 229–263. https://doi.org/10.3322/caac.21834

Shmatko

Ghaffari Laleh

Gerstung

, et al. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nature cancer 2022; 3: 1026–1038. https://doi.org/10.1038/s43018-022-00436-4

Courtiol

Maussion

Moarii

, et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nature medicine 2019; 25: 1519–1525. https://doi.org/10.1038/s41591-019-0583-3

Fang

Khan

, et al. SG-Fusion: A swin-transformer and graph convolution-based multi-modal deep neural network for glioma prognosis. Artificial Intelligence in Medicine 2024; 157: 102972. https://doi.org/10.1016/j.artmed.2024.102972

Wang

Yang

Zhang

, et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis 2022; 81: 102559. https://doi.org/10.1016/j.media.2022.102559

Chen

Ding

, et al. Towards a general-purpose foundation model for computational pathology. Nature medicine 2024; 30: 850–862. https://doi.org/10.1038/s41591-024-02857-3

Huang

Bianchi

Yuksekgonul

, et al. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 2023; 29: 2307–2316. https://doi.org/10.1038/s41591-023-02504-3

Shao

Bian

Chen

, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems 2021; 34: 2136–2147.

Yang

Zhao

, et al. DT-MIL: deformable transformer for multi-instance learning on histopathological image. International Conference on Medical Image Computing and Computer-Assisted Intervention 2021. Springer, pp. 206–216.

10.

Liu

Lin

Cao

, et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.

11.

Dao

. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:231200752 2023.

12.

Zhu

Liao

Zhang

, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:240109417 2024.

13.

Liu

Tian

Zhao

, et al. Vmamba: Visual state space model. Advances in neural information processing systems 2024; 37: 103031–103063.

14.

Fang

Wang

Zhang

, et al. Mammil: Multiple instance learning for whole slide images with state space models. 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 3200–3205.

15.

Williamson

Chen

, et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 2021; 5: 555–570. https://doi.org/10.1038/s41551-020-00682-w

16.

Chen

Williamson

, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer cell 2022; 40: 865–878. https://doi.org/10.1016/j.ccell.2022.07.004

17.

Jaume

Vaidya

Chen

, et al. Modeling dense multimodal interactions between biological pathways and histology for survival prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11579–11590.

18.

Luo

Huang

, et al. Multimodal multi-instance evidence fusion neural networks for cancer survival prediction. Scientific Reports 2025; 15: 10470. https://doi.org/10.1038/s41598-025-93770-3

19.

Zhou

Chen

. Cross-modal translation and alignment for survival analysis. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21485–21494.

20.

Mobadersany

Yousefi

Amgad

, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences 2018; 115: E2970–E2979. https://doi.org/10.1073/pnas.1717139115

21.

Zheng

Lin

Zhou

, et al. Multi-transsp: Multimodal transformer for survival prediction of nasopharyngeal carcinoma patients. International Conference on Medical Image Computing and Computer-Assisted Intervention 2022. Springer, pp. 234–243.

22.

Chen

Wang

, et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transactions on Medical Imaging 2020; 41: 757–770. https://doi.org/10.1109/TMI.2020.3021387

23.

Wang

, et al. GPDBN: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction. Bioinformatics 2021; 37: 2963–2970. https://doi.org/10.1093/bioinformatics/btab185

24.

Chen

Weng

W-H

, et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4015–4025.

25.

Zaheer

Kottur

Ravanbakhsh

, et al. Deep sets. Advances in neural information processing systems 2017; 30.

26.

Ilse

Tomczak

Welling

. Attention-based deep multiple instance learning. International conference on machine learning. PMLR, 2018, pp. 2127–2136.

27.

Yao

Zhu

Jonnagaddala

, et al. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical image analysis 2020; 65: 101789. https://doi.org/10.1016/j.media.2020.101789

28.

Haykin

. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ, 1994.

29.

Klambauer

Unterthiner

Mayr

, et al. Self-normalizing neural networks. Advances in neural information processing systems 2017; 30.

30.

Katzman

Shaham

Cloninger

, et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology 2018; 18: 24. https://doi.org/10.1186/s12874-018-0482-1

31.

Chen

. Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction. Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 21241–21251.

32.

Liu

Cui

, et al. Mgct: Mutual-guided cross-modality transformer for survival outcome prediction using integrative histopathology-genomic features. 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2023, pp. 1306–1312.

33.

Yang

Wang

, et al. MMsurv: a multimodal multi-instance multi-cancer survival prediction model integrating pathological images, clinical information, and sequencing data. Briefings in Bioinformatics 2025; 26: bbaf209. https://doi.org/10.1093/bib/bbaf209

34.

Zhang

Chen

, et al. Prototypical information bottlenecking and disentangling for multimodal cancer survival prediction. arXiv preprint arXiv:240101646, 2024.

35.

Song

Chen

Jaume

, et al. Multimodal prototyping for cancer survival prediction. arXiv preprint arXiv:240700224, 2024.

36.

Zhang

Chen

, et al. SAMamba: Integrating State Space Model for Enhanced Multi-modal Survival Analysis. 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 1334–1341.

37.

Yang

Wang

Chen

. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. International conference on medical image computing and computer-assisted intervention 2024. Springer, pp. 296–306.

38.

Dang

TDQ

Nguyen

Tiulpin

. LoG-VMamba: local-global vision mamba for medical image segmentation. Proceedings of the Asian Conference on Computer Vision, 2024, pp. 548–565.

39.

Chen

Xie

Lin

, et al. Survmamba: State space model with multi-grained multi-modal interaction for survival prediction. arXiv preprint arXiv:240408027, 2024.

40.

Song

Hao

Zhao

, et al. Dual-stream cross-modal fusion alignment network for survival analysis. Briefings in Bioinformatics 2025; 26: bbaf103. https://doi.org/10.1093/bib/bbaf103

41.

Zhang

Tao

, et al. GseaVis: an R package for enhanced visualization of gene set enrichment analysis in biomedicine. Med Research 2025.

42.

Dereli

Oğuz

Gönen

. Path2Surv: Pathway/gene set-based survival analysis using multiple kernel learning. Bioinformatics 2019; 35: 5137–5145. https://doi.org/10.1093/bioinformatics/btz446

43.

Wang

Kartasalo

Weitz

, et al. Predicting molecular phenotypes from histopathology images: a transcriptome-wide expression–morphology analysis in breast cancer. Cancer research 2021; 81: 5115–5126. https://doi.org/10.1158/0008-5472.CAN-21-0482

44.

Pizurica

Zheng

Carrillo-Perez

, et al. Digital profiling of gene expression from histology images with linearized attention. Nature Communications 2024; 15: 9886. https://doi.org/10.1038/s41467-024-54182-5

45.

Zhang

Chen

, et al. Integrating images and genomics for multi-modal cancer survival analysis via mixture of experts. Information Fusion 2025; 126: 103521. https://doi.org/10.1016/j.inffus.2025.103521

46.

Chen

Huang

, et al. A five-microRNA signature as risk stratification system in uterine corpus endometrial carcinoma. Combinatorial Chemistry & High Throughput Screening 2021; 24: 187–194. https://doi.org/10.2174/1386207323999200730211227

47.

Liu

Zhang

Yang

, et al. Estrogen receptor alpha activates MAPK signaling pathway to promote the development of endometrial cancer. Journal of cellular biochemistry 2019; 120: 17593–17601. https://doi.org/10.1002/jcb.29027

48.

Lei

Wang

, et al. A correlation study of adhesion G protein-coupled receptors as potential therapeutic targets in Uterine Corpus Endometrial cancer. International Immunopharmacology 2022; 108: 108743. https://doi.org/10.1016/j.intimp.2022.108743

49.

Song

, et al. Identification of immune-related gene signature for predicting prognosis in uterine corpus endometrial carcinoma. Scientific Reports 2023; 13: 9255. https://doi.org/10.1038/s41598-023-35655-x