Abstract
Accurate motion prediction of traffic agents is crucial for the safety and stability of intelligent decision-making autonomous driving systems. In this paper, we introduce GAMDTP, a novel graph attention-based network tailored for dynamic trajectory prediction. Specifically, we fuse the result of self attention and mamba-ssm through a gate mechanism, leveraging the strengths of both to extract features more efficiently and accurately, in each graph convolution layer. GAMDTP encodes the high-definition map(HD map) data and the agents’ historical trajectory coordinates and decodes the network’s output to generate the final prediction results. Additionally, recent approaches predominantly focus on dynamically fusing historical forecast results and rely on two-stage frameworks including proposal and refinement. To further enhance the performance of the two-stage frameworks we also design a scoring mechanism to evaluate the prediction quality during the proposal and refinement processes. Experiments on the Argoverse and INTERACTION datasets demonstrate that GAMDTP achieves state-of-the-art performance and has more advantages in capturing interaction features and ensuring security in dynamic trajectory prediction.
Introduction
Accurate motion forecasting of surrounding traffic agents, including vehicles, pedestrians, and other road participants, is critical to guarantee the safety and stability of autonomous driving systems. Predicting the trajectories of traffic agents with high precision allows autonomous systems to anticipate future states, make informed decisions in real-time and avoid risks while driving.
Researches in the early stage mainly used rasterized segmantic images to represent map information (Lee et al., 2017; Phan-Minh et al., 2020). However, due to the loss of information while rasterization, Gao et al. (2020) and Liang et al. (2020) both design a vector-based method that agents and roads are modeied as a collection of vectors. Azadani and Boukerche (2023), Chen et al. (2022), Wang et al. (2020), Zhang et al. (2022) are based on this and leverage GNNs (Velickovic et al., 2017) and LSTM (Hochreiter, 1997) to fuse spatio-temporal information for accurate and socially plausible vehicle trajectory prediction. However, LSTM-based methods are bottlenecked by the parallelization, memory efficiency, long term dependencies and training speed. Recent advances in this domain such as HiVT (Zhou et al., 2022), by considering the deep relationship between agents and scenario, agents and agents, as well as the selection of the direction of the coordinate system and other factors, the network achieves a fairly good effect. QCNet (Zhou et al., 2023) further investigates the impact of reusing historical calculations on the final prediction results. They presents an efficient, multi-modal trajectory prediction framework using a novel tow-stage, consists of proposal and refinement, with query-centric paradigm. By reusing scene encodings and combining anchor-based refining strategies, it achieves both fast inference and high prediction accuracy, making it well-suited for real-time autonomous driving scenarios. Morever, HPNet (Tang et al., 2024) integrates historical predictions with real-time context through its Historical Prediction Attention module, which dynamically models the relationship between successive predictions, resulting in more accurate and stable trajectory forecasts. In addition, many previous works (Chai et al., 2019; Gu et al., 2021b; Liu & Meidani, 2024, 2025; Liu et al., 2021; Tang et al., 2024; Varadarajan et al., 2022; Zhou et al., 2023, 2022) use multi-modal future trajectories as output rather than a single trajectory, given the uncertainty of future, and we also follow this way in this paper.
While most of those approaches are Graph Attention Networks (GAT) (Velickovic et al., 2017) based, which brings GNNs and Transformers together, and Transformers (Vaswani, 2017) can capture long range dependencies among nodes in a graph, they suffer from the limitation that their feature fusion strategies employ fixed-weight combinations, which can not dynamically adapt to varying traffic scenarios. This rigidity often leads to suboptimal feature representation, particularly in complex, dynamic environments where the relative importance of spatial and temporal features changes continuously. Recently, a brand new state space model (SSM) (Gu et al., 2021a), Mamba (Dao & Gu, 2024; Gu & Dao, 2023), demonstrates potential in sequence modeling and long-term dependencies capturing with linear computational complexity and improved GPU efficiency across tasks in natural language processing (He et al., 2024; Lieber et al., 2024; Team et al., 2024) and computer vision (Li et al., 2025; Zhang et al., 2025; Zhu et al., 2024). Despite its potential, Mamba-SSM remains underexplored in the context of graph-based trajectory prediction frameworks.
To address these limitations, we propose GAMDTP, a novel module that fuses Graph Attention Networks (GAT) (Velickovic et al., 2017) with the selective capabilities of Mamba-SSM (Gu & Dao, 2023). Inspired by Ding et al. (2024) in computational pathology, GAMDTP leverages the unique strengths of both GAT (Velickovic et al., 2017) and Mamba-SSM (Dao & Gu, 2024; Gu & Dao, 2023) through a gate mechanism, combining the self-attention mechanism’s adaptability to complex inter-agent interactions with Mamba’s efficient handling of long-range dependencies through structured state spaces. This fusion allows GAMDTP to deliver accuracy feature extraction efficiency, scalable computational performance, the ability to adapt to diverse and dynamic driving environments and making it particularly suited for real-time trajectory prediction.
Additionally, recognizing the limitations of existing two-stage trajectory prediction frameworks, where the proposal and refinement stages often lack effective cooperation, we introduce a Quality Scoring Mechanism following SmartRefine (Zhou et al., 2024). This mechanism evaluates the prediction quality at both stages, prioritizing high-quality trajectory proposals and improving the refinement process, ultimately leading to more accurate and reliable trajectory forecasts.
Our approach is evaluated on the Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets, both are standard benchmarks for autonomous driving scenarios, where GAMDTP demonstrates state-of-the-art performance. This enhancement in prediction capability not only strengthens the robustness of trajectory predictions but also contributes to the overall safety and stability of autonomous driving systems.
In summary, our work has the following contributions:
Our work uniquely adapts Mamba for graph-based trajectory prediction by developing a novel fusion strategy with graph attention networks. GAMDTP merges a score mechanism to evaluate the prediction results of proposal and refinement to improve the performence of the refine process. Experiments on the Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets demonstrate that GAMDTP achieves the state-of-the-art performance and has more advantages in capturing interaction features and ensuring security in dynamic trajectory prediction.
Related Work
GNNs and Temporal Models for Trajectory Prediction
The development of accurate and efficient trajectory prediction models is critical for autonomous driving, as they allow for anticipating the future states of traffic agents ensuring safety and operational stability for real-time decisions. To model the social spatial and temporal interactions between agents and agents, agents and lanes, (Liang et al., 2020; Wang et al., 2020) apply message-passing GNNs and encode agents and lanes as nodes, speed, direction and other dynamic information as edges. GNNs work by iteratively gathering information from neighboring nodes to update the current node’s representation, with different GNN types employing distinct aggregation and update functions. This process enables GNNs to learn representations that encapsulate the graph data’s topological structure. To model history trajectory and other sequence data, early approaches relied heavily on Recurrent Neural Networks(RNNs) (Schmidt, 2019) and Long Short-Term Memory networks(LSTMs) (Hochreiter, 1997) to model temporal dependencies in sequential data (Lee et al., 2017; Zyner et al., 2019). LSTMs have been widely used in autonomous driving applications for their ability to maintain sequential information over time and handle agent-specific histories (Alahi et al., 2016; Chen et al., 2022; Deo & Trivedi, 2018; Xing et al., 2019). Compared to LSTMs, Transformers show more powerful in parallelization and long-term dependency capture, which impacts both training and memory efficiency. Therefore, attention mechanism (Vaswani, 2017) has become the dominant method adopted by recent Hou et al. (2022), Li et al. (2023). Azadani and Boukerche (2023), Gu et al. (2021b), Ngiam et al. (2021), Wang et al. (2024), Zhou et al. (2022) fuse GNNs and Transformers and model different scenarios toward different cases.
Recently, a novel state space model (SSM) (Gu & Dao, 2023), Mamba, has shown promise in sequence modeling and capturing long-term dependencies (Gu et al., 2021a). Mamba introduces a selective mechanism into the SSM, enabling it to identify critical information similarly to an attention mechanism. Studies have highlighted Mamba’s potential across domains like natural language processing (He et al., 2024; Lieber et al., 2024; Team et al., 2024) and computer vision. However, Mamba’s potential in combination with GATs remains underexplored. In this paper, we fuse Mamba and attention mechanism in graph neural network with a gate mechanism for encoding HD map data and historical trajectory information.
Two-Stage Motion Forecasting
Inspired by the refinement networks (Carion et al., 2020; Ren et al., 2016) in computer vision, refinement strategies have recently been applied in motion forecasting. This framework typically involves a proposal stage, where multiple candidate trajectories are generated, followed by a refinement stage, where these proposals are optimized based on the context. QCNet (Zhou et al., 2023) employs a two-stage approach to improve efficiency and accuracy. Specifically, they leverages a query-centric paradigm to forecast the trajectory in the proposal stage and predict the offset in the refinement stage. HPNet (Tang et al., 2024) introduces a historical prediction attention module to encode the dynamic relation between successive predictions in the proposal stage and encodes the prediction with a two-layer MLP then recalculate the result in the same way in the refinement stage. But this does not produce better cooperation between the two stage. Inspired by SmartRefine (Zhou et al., 2024), they introduce a brand new framework for refinement and design a quality score mechanism, we design a scoring mechanism between the proposal and refinement stage following HPNet (Tang et al., 2024).
Method
In this section, we first introduce problem formulation for dynamic trajectory prediction in 3.1. In order to verify the performance of the modules we designed and make our network easier to understand, we will introduce the selected backbone network in 3.2. Then, we present our proposed Graph Attention Mamba Network and the quality scoring mechanism in the two-stage framework in 3.3 and 3.4 respectively. Ultimately, we introduce the training objective with the loss function in 3.5.
Problem Formulation
The target of trajectory prediction is predicting the future paths of interested agents based on their past movements. Given a fixed-length sequence of history status frames,
Our work is based on a SOTA approach HPNet (Tang et al., 2024). The encoder is applied by a two-layer MLP, following them, to encode the features of agents and HD map as embeddings:
Each agent at each time step and lane segment are treated as node in the graph. The edge features are represented as
The output embeddings from the encoder will be used as the input of Backbone. The Backbone network contains three main modules driven by our proposed module namely Agent GAM, Historical Prediction GAM and Mode GAM respectively. Agent GAM first input the prediction embeddings:
Then Historical Prediction GAM inputs the result of Agent GAM to model the correlation between historical predictions and current forecast. Finally, results of previous modules are entered into Mode GAM that models interactions among different future trajectory mode and the modules above are repeated
An overview of our method is showed in Figure 1. Our proposed module is applied in the Backbone and it is designed to enhance the feature extraction and prediction capabilities of the network.

Overview of GAMDTP. The Encoder Processes Raw Input Features Such as HD Map and Agent Trajectory Information. Our Proposed Graph Attention Mamba Module is Applied in the Components Agent GAM, Historical Prediction GAM and Mode GAM, Which Extracts Spatio-Temporal Features. Decoder Generates the Final Predicted Trajectories and Probability and the Score Decoder Further Evaluates and Prioritizes Trajectory Candidates for Refinement Through Generate a Score for Each Result, Ensuring Accurate and Reliable Predictions.
As illustrate in Figure 2(b), we use a normal Graph Attention layer as the GAT block. Graph Attention (Velickovic et al., 2017) uses an attention mechanism to learn the importance of each neighboring node to the current node. This make the message passing process focuses on the most relevant nodes to make predictions. The edge features are concatenated with the neighboring

Our Proposed Graph Attention Mamba Module (a), Which Integrates Mamaba Block and Normal Graph Attention Block (b). The Input Features Include Node Features and Edge Features, Which First Normalized Through a Layernorm (LN) Layer Before Processed by Mamba and GAT Blocks. The Output from These Blocks are Fused Using a Gate Mechanism, Where the Sigmoid Function Dynamically Generates a Gate Signal G to Balance Their Contributions.
As illustrate in Figure 2(a), we use a Mamba2 layer as the Mamba block. The
In our work, each node feature in the graph is a token of the input sequence as mentioned above. To simplify the expression, the following will use function
The input graph node features and edge features are
We also design a gate mechanism to fuse
To enhance the performance of two-stage trajectory prediction framework, we introduce a scoring mechanism the evaluates the prediction quality of both the proposal and refinement stages. At the training stage, the quality of predicted trajectory can be assessed according to the ground truth trajectory
In detail, using the maximum predicted error between the predicted result and the ground truth among all iterations, represented by
To enable GAMDTP to predict the quality score, we utilize a Mamba2 layer to process the prediction embedding at proposal stage. Subsequently, an MLP is employed to produce the quality score, as show in Algorithm 1:
To optimize the proposed model, we follow the winner-takes-all strategy, which ensures that the most relevant mode, based on the minimum endpoint displacement, is selected for optimization. Specifically, the
The probability
In summary, final training objective combines the loss functions above:
Datasets
To evaluate the performance of our model, we conduct experiments on the Argoverse and INTERACTION datasets.
In summary, Argoverse excels in providing a large, diverse dataset with detailed sensor data and HD maps for a wide range of autonomous driving tasks, especially for training perception and motion forcasting models. And INTERACTION is more specialized, focusing on capturing and analyzing complex interactions between different road users, particularly useful for behavior modeling and safety assessment in challenging driving scenarios.
Metrics
We utilized standard trajectory forecasting metrics, ensuring a comprehensive assessment across different prediction scenarios. These metrics include evaluations on both Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets, capturing the accuracy, reliability, and multimodal capabilities of the predictions.
For the Argoverse dataset, we employ minimum Average Displacement Error(minADE) and minimum Final Displacement Error(minFDE) to measure the accuracy of trajectory predictions. Specifically, minADE computes the average
For the INTERACTION (Zhan et al., 2019) dataset, we employ minJointADE, minJointFDE and Cross Collision Rate to evaluate the performance of joint trajectory prediction. MinJointADE measures the average
Comparison with State-of-the-Art
Definitions of Essential Variables in This Work.
Definitions of Essential Variables in This Work.
Comparison of GAMDTP With the State of the Art Methods on the Argoverse Test Set. The b-minFDE is the Official Ranking Metric. For Each Metric, the Best Result is in

Comparison Our GAMDTP with Baseline.
However, our work is slightly declining in b-minFDE, minFDE and single mode metrics, this shows that GAMDTP fall short in predicting the trajectory endpoints and accuracy of single mode. Specifically, the decline of b-minFDE, FDE of 6-modes and 1-mode indicates that while our method better captures overall motion trends, it pays a small penalty in final-position accuracy. We attribute this to Mamba-ssm’s bias toward global temporal coherence, which can slightly dilute focus on local spatial constraints near trajectory endpoints.
Additionally, to validate the inference speed and the amount of computation for training of our model, we performed inference speed tests on the validation set of Argoverse and recorded the parameter quantity during training phase.
As shown in Table 3, the inference speed of our GAMDTP has been accelerated because the Mamba block is more parallel and computationally efficient. However, the improvement is limited because the model structure still contains the attention mechanism. Moreover, the addition of the Mamba branch resulted in an increase in the number of parameters in the training phase and led to an increase in the training time cost.
Comparison of Inference Time and Parameters on Argoverse Validation Set. For Each Metric, the Best Result is in
Comparison of GAMDTP With the State of the Art Methods on the INTERACTION Test Set. For Each Metric, the Best Result is in
Although our model outperform HPNet in minJointADE and CCR, we still have 0.78% increase in minJointFDE. Both the results of Argoverse and INTERACTION suggests that GAMDTP has a subtle limitation in final-position accuracy due to the Mamba-ssm’s temporal bias may marginally dilute spatial precision at trajectory endpoints.
While our model demostrates superior preformance on the INTERACTION dataset, this advantage appears particularly pronounced in scenarios involving complex multi-agent interactions and safety-critical situations. This suggests that our GAMDTP’s dynamic gating mechanism and quality scoring may be particularly effective at modeling the nuanced interaction patterns and collision avoidance behaviors prevalant in this dataset. However, the performance gap narrows in simpler, more predictable scenarios, indicating that our model’s added complexity may not always be justified when interaction complexity is low.
To check the effectiveness of the key components in our model, we conduct a series of ablation experiments on the INTERACTION (Zhan et al., 2019) test set. Specifically, we evaluate the impact of the gate mechanism, quality scoring mechanism and the number of Mamba layers, which represents the number of stacked Mamba layers in GAM module. The results are summarized in Table 5.
Ablation Study on INTERACTION Test Set.
Ablation Study on INTERACTION Test Set.
Our GAMDTP trains on 1 RTX A6000 GPU for 64 epochs, using the AdamW (Merity et al., 2017) optimizer with a batch size of 4, dropout rate of 0.1, and weight decay of
On Argoverse, we set a 50 radius (50 meters in real word) for all local areas as the interaction field. On INTERACTION, the radius is 80.
Conclusion
In this paper, we introduced GAMDTP, a novel framework for accurate and efficient trajectory forecasting in autonomous driving scenarios. By integrating Mamba-SSM and Graph Attention Networks (GAT) through a dynamic gating mechanism, our model effectively captures interaction features, ensures security and also accelerate inference speed. To further enhance the two-stage trajectory prediction framework, we designed a Quality Scoring Mechanism, which evaluates trajectory proposals and prioritizes high-quality candidates during refinement. Our experimental results on the Argoverse and INTERACTION datasets demonstrate that GAMDTP achieves state-of-the-art performance.
Despite its strengths, our model exhibits certain limitations. It occasionally underperforms compared to baseline method, particularly in predicting trajectory endpoints and single mode prediction. And the integration of Mamba and GAT, while effective, introduces higher computational costs compared to simpler architectures. Specifically, the dynamic gating mechanism increase the parameters relative to baseline models. This trade-off is justified by the improved accuracy but may limit deployment on resource-constrained edge devices.
In summary, GAMDTP offers a scalable and reliable solution for dynamic trajectory forecasting, advancing the capabilities of autonomous driving systems. In the near future, we will focus on addressing these identified limitations. Exploring advanced gating techniques, such as attention-based gating or mixture-of-experts frameworks, could potentially enhance the model’s performance in dense interaction scenarios. Additionally, adaptive mechanism to dynamically manage computational resources based on agent density and interaction complexity could further improve both prediction accuracy and inference speed. Finally, expanding the model’s evaluation across diverse datasets with varying agent densities and interaction complexities would validate its robustness and practical applicability in real-world autonomous driving systems.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
