GAMDTP: Dynamic Trajectory Prediction with Graph Attention Mamba Network

Abstract

Accurate motion prediction of traffic agents is crucial for the safety and stability of intelligent decision-making autonomous driving systems. In this paper, we introduce GAMDTP, a novel graph attention-based network tailored for dynamic trajectory prediction. Specifically, we fuse the result of self attention and mamba-ssm through a gate mechanism, leveraging the strengths of both to extract features more efficiently and accurately, in each graph convolution layer. GAMDTP encodes the high-definition map(HD map) data and the agents’ historical trajectory coordinates and decodes the network’s output to generate the final prediction results. Additionally, recent approaches predominantly focus on dynamically fusing historical forecast results and rely on two-stage frameworks including proposal and refinement. To further enhance the performance of the two-stage frameworks we also design a scoring mechanism to evaluate the prediction quality during the proposal and refinement processes. Experiments on the Argoverse and INTERACTION datasets demonstrate that GAMDTP achieves state-of-the-art performance and has more advantages in capturing interaction features and ensuring security in dynamic trajectory prediction.

Keywords

Trajectory prediction graph attention network mamba-ssm dynamic intelligent systems

1. Introduction

Accurate motion forecasting of surrounding traffic agents, including vehicles, pedestrians, and other road participants, is critical to guarantee the safety and stability of autonomous driving systems. Predicting the trajectories of traffic agents with high precision allows autonomous systems to anticipate future states, make informed decisions in real-time and avoid risks while driving.

Researches in the early stage mainly used rasterized segmantic images to represent map information (Lee et al., 2017; Phan-Minh et al., 2020). However, due to the loss of information while rasterization, Gao et al. (2020) and Liang et al. (2020) both design a vector-based method that agents and roads are modeied as a collection of vectors. Azadani and Boukerche (2023), Chen et al. (2022), Wang et al. (2020), Zhang et al. (2022) are based on this and leverage GNNs (Velickovic et al., 2017) and LSTM (Hochreiter, 1997) to fuse spatio-temporal information for accurate and socially plausible vehicle trajectory prediction. However, LSTM-based methods are bottlenecked by the parallelization, memory efficiency, long term dependencies and training speed. Recent advances in this domain such as HiVT (Zhou et al., 2022), by considering the deep relationship between agents and scenario, agents and agents, as well as the selection of the direction of the coordinate system and other factors, the network achieves a fairly good effect. QCNet (Zhou et al., 2023) further investigates the impact of reusing historical calculations on the final prediction results. They presents an efficient, multi-modal trajectory prediction framework using a novel tow-stage, consists of proposal and refinement, with query-centric paradigm. By reusing scene encodings and combining anchor-based refining strategies, it achieves both fast inference and high prediction accuracy, making it well-suited for real-time autonomous driving scenarios. Morever, HPNet (Tang et al., 2024) integrates historical predictions with real-time context through its Historical Prediction Attention module, which dynamically models the relationship between successive predictions, resulting in more accurate and stable trajectory forecasts. In addition, many previous works (Chai et al., 2019; Gu et al., 2021b; Liu & Meidani, 2024, 2025; Liu et al., 2021; Tang et al., 2024; Varadarajan et al., 2022; Zhou et al., 2023, 2022) use multi-modal future trajectories as output rather than a single trajectory, given the uncertainty of future, and we also follow this way in this paper.

While most of those approaches are Graph Attention Networks (GAT) (Velickovic et al., 2017) based, which brings GNNs and Transformers together, and Transformers (Vaswani, 2017) can capture long range dependencies among nodes in a graph, they suffer from the limitation that their feature fusion strategies employ fixed-weight combinations, which can not dynamically adapt to varying traffic scenarios. This rigidity often leads to suboptimal feature representation, particularly in complex, dynamic environments where the relative importance of spatial and temporal features changes continuously. Recently, a brand new state space model (SSM) (Gu et al., 2021a), Mamba (Dao & Gu, 2024; Gu & Dao, 2023), demonstrates potential in sequence modeling and long-term dependencies capturing with linear computational complexity and improved GPU efficiency across tasks in natural language processing (He et al., 2024; Lieber et al., 2024; Team et al., 2024) and computer vision (Li et al., 2025; Zhang et al., 2025; Zhu et al., 2024). Despite its potential, Mamba-SSM remains underexplored in the context of graph-based trajectory prediction frameworks.

To address these limitations, we propose GAMDTP, a novel module that fuses Graph Attention Networks (GAT) (Velickovic et al., 2017) with the selective capabilities of Mamba-SSM (Gu & Dao, 2023). Inspired by Ding et al. (2024) in computational pathology, GAMDTP leverages the unique strengths of both GAT (Velickovic et al., 2017) and Mamba-SSM (Dao & Gu, 2024; Gu & Dao, 2023) through a gate mechanism, combining the self-attention mechanism’s adaptability to complex inter-agent interactions with Mamba’s efficient handling of long-range dependencies through structured state spaces. This fusion allows GAMDTP to deliver accuracy feature extraction efficiency, scalable computational performance, the ability to adapt to diverse and dynamic driving environments and making it particularly suited for real-time trajectory prediction.

Additionally, recognizing the limitations of existing two-stage trajectory prediction frameworks, where the proposal and refinement stages often lack effective cooperation, we introduce a Quality Scoring Mechanism following SmartRefine (Zhou et al., 2024). This mechanism evaluates the prediction quality at both stages, prioritizing high-quality trajectory proposals and improving the refinement process, ultimately leading to more accurate and reliable trajectory forecasts.

Our approach is evaluated on the Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets, both are standard benchmarks for autonomous driving scenarios, where GAMDTP demonstrates state-of-the-art performance. This enhancement in prediction capability not only strengthens the robustness of trajectory predictions but also contributes to the overall safety and stability of autonomous driving systems.

In summary, our work has the following contributions:

Our work uniquely adapts Mamba for graph-based trajectory prediction by developing a novel fusion strategy with graph attention networks.

GAMDTP merges a score mechanism to evaluate the prediction results of proposal and refinement to improve the performence of the refine process.

Experiments on the Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets demonstrate that GAMDTP achieves the state-of-the-art performance and has more advantages in capturing interaction features and ensuring security in dynamic trajectory prediction.

2. Related Work

2.1. GNNs and Temporal Models for Trajectory Prediction

The development of accurate and efficient trajectory prediction models is critical for autonomous driving, as they allow for anticipating the future states of traffic agents ensuring safety and operational stability for real-time decisions. To model the social spatial and temporal interactions between agents and agents, agents and lanes, (Liang et al., 2020; Wang et al., 2020) apply message-passing GNNs and encode agents and lanes as nodes, speed, direction and other dynamic information as edges. GNNs work by iteratively gathering information from neighboring nodes to update the current node’s representation, with different GNN types employing distinct aggregation and update functions. This process enables GNNs to learn representations that encapsulate the graph data’s topological structure. To model history trajectory and other sequence data, early approaches relied heavily on Recurrent Neural Networks(RNNs) (Schmidt, 2019) and Long Short-Term Memory networks(LSTMs) (Hochreiter, 1997) to model temporal dependencies in sequential data (Lee et al., 2017; Zyner et al., 2019). LSTMs have been widely used in autonomous driving applications for their ability to maintain sequential information over time and handle agent-specific histories (Alahi et al., 2016; Chen et al., 2022; Deo & Trivedi, 2018; Xing et al., 2019). Compared to LSTMs, Transformers show more powerful in parallelization and long-term dependency capture, which impacts both training and memory efficiency. Therefore, attention mechanism (Vaswani, 2017) has become the dominant method adopted by recent Hou et al. (2022), Li et al. (2023). Azadani and Boukerche (2023), Gu et al. (2021b), Ngiam et al. (2021), Wang et al. (2024), Zhou et al. (2022) fuse GNNs and Transformers and model different scenarios toward different cases.

Recently, a novel state space model (SSM) (Gu & Dao, 2023), Mamba, has shown promise in sequence modeling and capturing long-term dependencies (Gu et al., 2021a). Mamba introduces a selective mechanism into the SSM, enabling it to identify critical information similarly to an attention mechanism. Studies have highlighted Mamba’s potential across domains like natural language processing (He et al., 2024; Lieber et al., 2024; Team et al., 2024) and computer vision. However, Mamba’s potential in combination with GATs remains underexplored. In this paper, we fuse Mamba and attention mechanism in graph neural network with a gate mechanism for encoding HD map data and historical trajectory information.

2.2. Two-Stage Motion Forecasting

Inspired by the refinement networks (Carion et al., 2020; Ren et al., 2016) in computer vision, refinement strategies have recently been applied in motion forecasting. This framework typically involves a proposal stage, where multiple candidate trajectories are generated, followed by a refinement stage, where these proposals are optimized based on the context. QCNet (Zhou et al., 2023) employs a two-stage approach to improve efficiency and accuracy. Specifically, they leverages a query-centric paradigm to forecast the trajectory in the proposal stage and predict the offset in the refinement stage. HPNet (Tang et al., 2024) introduces a historical prediction attention module to encode the dynamic relation between successive predictions in the proposal stage and encodes the prediction with a two-layer MLP then recalculate the result in the same way in the refinement stage. But this does not produce better cooperation between the two stage. Inspired by SmartRefine (Zhou et al., 2024), they introduce a brand new framework for refinement and design a quality score mechanism, we design a scoring mechanism between the proposal and refinement stage following HPNet (Tang et al., 2024).

3. Method

In this section, we first introduce problem formulation for dynamic trajectory prediction in 3.1. In order to verify the performance of the modules we designed and make our network easier to understand, we will introduce the selected backbone network in 3.2. Then, we present our proposed Graph Attention Mamba Network and the quality scoring mechanism in the two-stage framework in 3.3 and 3.4 respectively. Ultimately, we introduce the training objective with the loss function in 3.5.

3.1. Problem Formulation

The target of trajectory prediction is predicting the future paths of interested agents based on their past movements. Given a fixed-length sequence of history status frames, ${f_{- T + 1}, f_{- T + 2}, \dots, f_{0}}$ , the goal is to predict K diverse possible trajectories for each of the N agents, as illustrated below:

L_{0} = {L_{0, n, k}}_{n \in [1, N], k \in [1, K]}

(1)

where

f_{t} = {a_{t}^{1 \sim N}, M}

a_{t}^{1 \sim N}

represents the features of all agents in the scene at time t,

M

denotes the HD map including

N_{M}

lane segments, and

k

is the predicted trajectory index in the multi-modal prediction framework. Specifically,

a_{t}^{n} = {p_{x}^{t, n}, p_{y}^{t, n}, θ^{t, n}, v_{x}^{t, n}, v_{y}^{t, n}, c_{a}^{t, n}}

, where

(p_{x}^{t, n}, p_{y}^{t, n})

means the location,

θ^{t, n}

is the orientation,

(v_{x}^{t, n}, v_{y}^{t, n})

is the speed and

c_{a}^{t, n}

is the attribute, including agent type (ego vehicle, nearby vehicle, pedestrian) and agent length. Every trajectory includes future locations for the next

F

time steps and the model will predict the future trajectory for multiple agents simultaneously:

L_{0, n, k} = {l_{1, n, k}, l_{2, n, k}, \dots l_{F, n, k}}

(2)

where

l_{i, n, k} \in R^{2}

represents the predicted position at time step

i

of mode

k

for agent

n

3.2. HPNet Backbone

Our work is based on a SOTA approach HPNet (Tang et al., 2024). The encoder is applied by a two-layer MLP, following them, to encode the features of agents and HD map as embeddings:

\begin{aligned} E_{a}^{t, n} & = M L P_{ω} (v^{t, n}, φ^{t, n}, c_{a}^{t, n}) \end{aligned}

(3)

\begin{aligned} E_{m} & = M L P_{β} (l_{m}, c_{m}) \end{aligned}

(4)

where

l_{m}

is the length of lane segments,

E_{m} \in R^{M \times D}

E_{a}^{t, n} \in R^{D}

D

is the hidden dimension and

c_{m}

is the attributes of lane segments. Specifically,

c_{m}

includes lane heading, lane turn direction and whether it is an intersection.Because the reference frame has been changed to a local polar coordination system and its orientation as the positive direction, the velocity is represented as

(v^{t, n}, φ^{t, n})

Each agent at each time step and lane segment are treated as node in the graph. The edge features are represented as ${d_{e}, ϕ_{e}, ψ_{e}, δ_{e}}$ , where $d_{e}$ denotes the distance between the source and target nodes, $ϕ_{e}$ represents the orientation of the edge in the reference frame of the target node, $ψ_{e}$ is the relative orientation between source and target nodes, and $δ_{e}$ corresponds to the time difference between them. The edge features are encoded into edge embeddings through a two-layer MLP:

\begin{aligned} E_{e} = M L P_{γ} (d_{e}, ϕ_{e}, ψ_{e}, δ_{e}) \end{aligned}

(5)

where

E_{e} \in R^{Y \times D}

Y

is the number of edges. The

ω

β

and

γ

M L P

means they are trained independently.

The output embeddings from the encoder will be used as the input of Backbone. The Backbone network contains three main modules driven by our proposed module namely Agent GAM, Historical Prediction GAM and Mode GAM respectively. Agent GAM first input the prediction embeddings:

P_{t, n, k} = H P (E_{m}, E_{e}, E_{a}^{t, n})

(6)

where function

H P

means the preliminary process method in HPNet (Tang et al., 2024), to model the interactions among agents.

Then Historical Prediction GAM inputs the result of Agent GAM to model the correlation between historical predictions and current forecast. Finally, results of previous modules are entered into Mode GAM that models interactions among different future trajectory mode and the modules above are repeated $N_{r e p} = 2$ times. To further model the sequence relationships, a Mamba block is employed at the end of the three modules.

3.3. Graph Attention Mamba Module with Gate Mechanism

An overview of our method is showed in Figure 1. Our proposed module is applied in the Backbone and it is designed to enhance the feature extraction and prediction capabilities of the network.

Figure 1.

Overview of GAMDTP. The Encoder Processes Raw Input Features Such as HD Map and Agent Trajectory Information. Our Proposed Graph Attention Mamba Module is Applied in the Components Agent GAM, Historical Prediction GAM and Mode GAM, Which Extracts Spatio-Temporal Features. Decoder Generates the Final Predicted Trajectories and Probability and the Score Decoder Further Evaluates and Prioritizes Trajectory Candidates for Refinement Through Generate a Score for Each Result, Ensuring Accurate and Reliable Predictions.

3.3.1. Graph Attention Block

As illustrate in Figure 2(b), we use a normal Graph Attention layer as the GAT block. Graph Attention (Velickovic et al., 2017) uses an attention mechanism to learn the importance of each neighboring node to the current node. This make the message passing process focuses on the most relevant nodes to make predictions. The edge features are concatenated with the neighboring $(n e i)$ and current $(c u r)$ node features when computing the attention coefficients:

\begin{aligned} α_{c u r, n e i} = \frac{\exp (LeakyReLU (a^{T} [W X_{c u r} ‖ W X_{n e i} ‖ W E_{c u r, n e i}]))}{\sum_{a l l \in N_{c u r}} \exp (LeakyReLU (a^{T} [W X_{c u r} ‖ W X_{a l l} ‖ W E_{c u r, a l l}]))} \end{aligned}

(7)

where

T

represents transposition,

‖

is concatenation,

a

is learnable parameters that learn the attention coefficients,

W

is a shared linear transformation that applied to each node,

X

is node features,

E

is edge features,

N_{n e i}

is some neighboring node of current node

c u r

and

a l l

represents all neighboring nodes of current node

c u r

. After computing

α_{c u r, n e i}

for all neighboring nodes, the current node

c u r

’s features will updated by the weighted sum of its neighboring node features. To simplify the expression, the following will use function

G A T ()

to represent the GAT block.

Figure 2.

Our Proposed Graph Attention Mamba Module (a), Which Integrates Mamaba Block and Normal Graph Attention Block (b). The Input Features Include Node Features and Edge Features, Which First Normalized Through a Layernorm (LN) Layer Before Processed by Mamba and GAT Blocks. The Output from These Blocks are Fused Using a Gate Mechanism, Where the Sigmoid Function Dynamically Generates a Gate Signal G to Balance Their Contributions.

3.3.2. Mamba-ssm Block

As illustrate in Figure 2(a), we use a Mamba2 layer as the Mamba block. The $x (j) \in R^{2}$ and $y (j) \in R^{2}$ means the input and output data of the detailed Mamba block respectively, $h (j) \in R^{2}$ is the hidden state:

\begin{aligned} h^{'} (j) & = A h (j) + B x (j) \end{aligned}

(8)

\begin{aligned} y (j) & = C h (j) \end{aligned}

(9)

where

j

represents the

j

-th element in the sequence,

A

is the state matrix that compress all past information of the sequence,

B

is the input matrix and

C

is the output matrix. Collectively,

A h (j)

captures the autonomous evolution of the current state,

B x (j)

encodes the input-driven state modification, and

C h (j)

governs the transformation from latent state to output.

A

and

B

are discretized by a step size

Δ

A = e x p (Δ A), B = (Δ A)^{- 1} (e x p (Δ A) - I) Δ B

(10)

To avoid

A

B

and

C

being fixed constant, and do not change with the input

x (j)

, Mamba introduced a selection mechanism that dynamically filters relevant information by conditioning the parameters

B

C

and

Δ

over the input

x (j)

. This enables context-aware modulation:

B

regulates how strongly the current input

x (j)

updates the hidden state

h (j)

, while

C

governs the contribution of

h (j)

to the output

y (j)

. The discretization step

Δ

acts as a data-dependent gating factor, where larger values of

Δ

shift the model’s attention toward the current input

x (j)

rather than historical states, effectively controlling the retention or suppression of incoming information.

In our work, each node feature in the graph is a token of the input sequence as mentioned above. To simplify the expression, the following will use function $M a m ()$ to represent the Mamba block.

3.3.3. Graph Attention Mamba

The input graph node features and edge features are $P_{t, n, k}$ and $[P_{t, n^{^{'}}, k}, E_{e}]$ respectively, where $n^{^{'}}$ represents all agents within a certain radius of the $n$ -th agent in the same time step and mode. Node features are passed into both Mamba block and GAT block, edge features are just pass through GAT block:

\begin{aligned} P_{t, n, k}^{M} & = P_{t, n, k} + M a m (L N (P_{t, n, k})) \end{aligned}

(11)

\begin{aligned} P_{t, n, k}^{A} & = P_{t, n, k} + G A T (L N (P_{t, n, k}), L N ([P_{t, n^{^{'}}, k}, E_{e}])) \end{aligned}

(12)

where

a

is the input sequence,

G A T

function is the graph attention layer,

L N

means layer normalization and

P_{t, n, k}^{M}

P_{t, n, k}^{A}

is the output from the Mamba block and GAT block.

We also design a gate mechanism to fuse $P_{t, n, k}^{M}$ and $P_{t, n, k}^{A}$ :

\begin{aligned} G_{t, n, k} & = σ (F_{f c} (P_{t, n, k}^{M} + P_{t, n, k}^{A})) \end{aligned}

(13)

\begin{aligned} P_{t, n, k}^{G} & = P_{t, n, k} + G_{t, n, k} \cdot P_{t, n, k}^{A} + (1 - G_{t, n, k}) \cdot P_{t, n, k}^{M} \end{aligned}

(14)

where

F_{f} c (X)

is a fully connected layer,

\cdot

is the sigmoid function,

P_{t, n, k}^{G}

is the output of the whole module.

3.4. Quality Scoring Mechanism

To enhance the performance of two-stage trajectory prediction framework, we introduce a scoring mechanism the evaluates the prediction quality of both the proposal and refinement stages. At the training stage, the quality of predicted trajectory can be assessed according to the ground truth trajectory $p_{g t}$ and the predicted trajectory $p_{o u t}$ , inspired by SmartRefine (Zhou et al., 2024). Specifically, the final prediction of two-stage prediction framework is composed of the addition of two stages, that is, the result of the refinement stage $Δ p$ is a correction of the result of the proposal stage:

p_{o u t} = p + Δ p

(15)

where

p

is the result of proposal stage and for the following simplified expression, we use

p

and

p_{o u t}

instead of

{L_{0}, L_{1}, \dots, L_{F}}

In detail, using the maximum predicted error between the predicted result and the ground truth among all iterations, represented by $d_{m a x}$ and explained in 1, calculate the ratio of the absolute value of the difference between the proposal stage and refinement stage result and the absolute value of the difference between the refinement stage result and $d_{m a x}$ to obtain the quality score:

\begin{aligned} D i s (p, p_{g t}) & = | p - p_{g t} | \end{aligned}

(16)

\begin{aligned} d_{p} = D i s (p, p_{g t}), d_{r} & = D i s (p_{o u t}, p_{g t}) \end{aligned}

(17)

\begin{aligned} Q (d_{m a x}, d_{p}, d_{r}) & = \frac{| d_{p} - d_{r} |}{| d_{m a x} - d_{r} | + ϵ} \end{aligned}

(18)

\begin{aligned} q_{t, n} & = M L P_{μ} (Q (d_{m a x}, d_{p}, d_{r})) \end{aligned}

(19)

where

d_{p}

is the predict error at proposal stage,

d_{r}

is the predict error at refinement stage,

Q ()

represents the score function. In order to ensure that the calculation is differentiable, we add a very small value

ϵ

that is not 0 to the denominator.

To enable GAMDTP to predict the quality score, we utilize a Mamba2 layer to process the prediction embedding at proposal stage. Subsequently, an MLP is employed to produce the quality score, as show in Algorithm 1:

{\hat{q}}_{t, n} = M a m (L N (q_{t, n}))

(20)

where

{\hat{q}}_{t, n}^{i t}

represents the predicted score.

3.5. Training Loss

To optimize the proposed model, we follow the winner-takes-all strategy, which ensures that the most relevant mode, based on the minimum endpoint displacement, is selected for optimization. Specifically, the $k_{t, n}$ -th mode to be optimized is determined by minimizing the endpoint displacement between the predicted trajectory ${L_{t, n, k}}, k \in [1, K]$ and the ground truth trajectory $P_{t, n}^{g t} = {p_{t + 1, n}^{g t}, p_{t + 2, n}^{g t}, \dots, p_{t + F, n}^{g t}}$ :

k_{t, n} = \underset{k \in [1, K]}{\arg min} (l_{t + F, n, k}, p_{t + F, n}^{g t})

(21)

Then two Huber losses are employed to optimize the trajectories both in proposal and refinement stage:

\begin{aligned} L_{r e g 1}^{t, n} & = L_{H u b e r} (L_{t, n, k_{t, n}}^{p}, P_{t, n}^{g t}) \end{aligned}

(22)

\begin{aligned} L_{r e g 2}^{t, n} & = L_{H u b e r} (L_{t, n, k_{t, n}}^{r}, P_{t, n}^{g t}) \end{aligned}

(23)

where

L_{t, n, k_{t, n}}^{p}

is the predicted result in proposal stage,

L_{t, n, k_{t, n}}^{r}

is in refinement stage.

The probability $α_{t, n, k}$ for each predicted trajectory are optimized using a cross-entropy loss:

L_{c l s}^{t, n} = L_{C E} ({α_{t, n, k}}_{k \in [1, K]}, k_{t, n})

(24)

For the quality scoring mechanism, we calculate the

ℓ_{1}

loss between the predicted score

{\hat{q}}_{t, n}

and labeled score

q_{t, n}

L_{s} = ∥ 2 \times {\hat{q}}_{t, n} - q_{t, n} ∥_{1}

(25)

In summary, final training objective combines the loss functions above:

L = \frac{1}{T N} \sum_{t = - T + 1}^{0} \sum_{n = 1}^{N} (L_{r e g 1}^{t, n} + L_{r e g 2}^{t, n} + L_{c l s}^{t, n} + λ \cdot L_{s})

(26)

where

λ

is a hyper-parameter to balance the four loss terms.

4. Experiments

4.1. Datasets

To evaluate the performance of our model, we conduct experiments on the Argoverse and INTERACTION datasets.

Argoverse (Chang et al., 2019) is a widely used benchmark for motion forecasting and perception tasks in autonomous driving. It comprises 324,557 interesting vehicle trajectories extracted from over 1,000 driving hours in real-world scenarios. This rich dataset includes high-definition (HD) maps and recordings of sensor data, referred to as “log segments,” collected in two U.S. cities: Miami and Pittsburgh. These cities were chosen for their distinct urban driving challenges, including unique road geometries, local driving habits, and a variety of traffic conditions.

INTERACTION (Zhan et al., 2019) is a comprehensive resource designed to support research in autonomous driving, particularly in behavior-related areas such as motion prediction, behavior cloning, and behavior analysis. It offers a large-scale collection of naturalistic motions from various traffic participants, including vehicles and pedestrians, across a diverse set of highly interactive driving scenarios from different countries.

In summary, Argoverse excels in providing a large, diverse dataset with detailed sensor data and HD maps for a wide range of autonomous driving tasks, especially for training perception and motion forcasting models. And INTERACTION is more specialized, focusing on capturing and analyzing complex interactions between different road users, particularly useful for behavior modeling and safety assessment in challenging driving scenarios.

4.2. Metrics

We utilized standard trajectory forecasting metrics, ensuring a comprehensive assessment across different prediction scenarios. These metrics include evaluations on both Argoverse (Chang et al., 2019) and INTERACTION (Zhan et al., 2019) datasets, capturing the accuracy, reliability, and multimodal capabilities of the predictions.

For the Argoverse dataset, we employ minimum Average Displacement Error(minADE) and minimum Final Displacement Error(minFDE) to measure the accuracy of trajectory predictions. Specifically, minADE computes the average $ℓ_{2}$ -norm distance between the predicted trajectory and the ground truth across all time steps, while minFDE focuses on the $ℓ_{2}$ -norm distance at the final trajectory point. To further assess reliability, we included the Miss Rate(MR), which calculates the proportion of predicted trajectories whose endpoints deviate more than 2.0 meters from the actual ground truth endpoint. Additionally, we employed Brier Minimum Final Displacement Error(b-minFDE), which extends minFDE by integrating a confidence term $(1 - \hat{α})^{2}$ , where $\hat{α}$ represents the predicted probability of the best trajectory. This metric combines endpoint accuracy with the model’s confidence, offering deeper insights into the reliability of its predictions.

For the INTERACTION (Zhan et al., 2019) dataset, we employ minJointADE, minJointFDE and Cross Collision Rate to evaluate the performance of joint trajectory prediction. MinJointADE measures the average $ℓ_{2}$ -norm distance between the predicted and ground-truth trajectories of all agents, while minJoint FDE evaluates the $ℓ_{2}$ -norm distance at the final time step for all agents. To assess the model’s ability to capture multimodal outputs, we set $K = 6$ for both marginal and joint predictions.

4.3. Comparison with State-of-the-Art

Results on Argoverse. The results for marginal trajectory prediction on the Argoverse (Chang et al., 2019) dataset are presented in Tables 1 and 2. Our GAMDTP achieves the SOTA performance across all evaluation metrics among single models. Compared to HPNet (Tang et al., 2024), the second-best model on Argoverse leaderboard, GAMDTP improves from 0.7612 to 0.7603 in minADE where mode number $K = 6$ and from 0.5514 to 0.5509 in MR where mode number $K = 1$ . These improvements highlight the effectiveness of our model in accurately capturing whole-trajectory patterns across multiple modes. This stems from our dynamic gate mechanism’s ability to adaptively balance spatial and temporal features. We show some examples in Figure 3.

Table 1.
Definitions of Essential Variables in This Work.

Agent Variables Embedding Variables

$(p_{x}^{t, n}, p_{y}^{t, n})$ Agent location $E_{a}^{t, n}$ Agent embedding

$θ^{t, n}$ Orientation $E_{m}$ HD map embedding

$(v_{x}^{t, n}, v_{y}^{t, n})$ Speed $E_{e}$ Edge embedding

$c_{a}^{t, n}$ Attribute

Edge Variables Prediction Variables

$d_{e}$ Source-target distance $P_{t, n, k}$ Proposal embedding

$ϕ_{e}$ Edge orientation $P_{t, n, k}^{M}$ Mamba output

$ψ_{e}$ Source-target orientation $P_{t, n, k}^{A}$ GAT output

$δ_{e}$ Time difference $P_{t, n, k}^{G}$ GAM output

Other Variables

$l_{i, n, k}$ Predicted position $p$ Proposal prediction

$(v^{t, n}, φ^{t, n})$ Velocity $p_{o u t}$ Final prediction

$q_{t, n}$ Prediction score $Δ p$ Refinement prediction

Agent Variables	Embedding Variables
$(p_{x}^{t, n}, p_{y}^{t, n})$	Agent location	$E_{a}^{t, n}$	Agent embedding
$θ^{t, n}$	Orientation	$E_{m}$	HD map embedding
$(v_{x}^{t, n}, v_{y}^{t, n})$	Speed	$E_{e}$	Edge embedding
$c_{a}^{t, n}$	Attribute
Edge Variables	Prediction Variables
$d_{e}$	Source-target distance	$P_{t, n, k}$	Proposal embedding
$ϕ_{e}$	Edge orientation	$P_{t, n, k}^{M}$	Mamba output
$ψ_{e}$	Source-target orientation	$P_{t, n, k}^{A}$	GAT output
$δ_{e}$	Time difference	$P_{t, n, k}^{G}$	GAM output
Other Variables
$l_{i, n, k}$	Predicted position	$p$	Proposal prediction
$(v^{t, n}, φ^{t, n})$	Velocity	$p_{o u t}$	Final prediction
$q_{t, n}$	Prediction score	$Δ p$	Refinement prediction

Table 2.

Comparison of GAMDTP With the State of the Art Methods on the Argoverse Test Set. The b-minFDE is the Official Ranking Metric. For Each Metric, the Best Result is in Bold, the Second Best Result is Underlined.

$M e t h o d$	${b-minFDE}_{6} ↓$	$m i n F D E_{6} ↓$	$m i n A D E_{6} ↓$	$M R_{6} ↓$	$m i n F D E_{1} ↓$	$m i n A D E_{1} ↓$	$M R_{1} ↓$
LaneGCN (Liang et al., 2020)	2.0539	1.3622	0.8703	0.1620	3.7624	1.7019	0.5877
mmTransformer (Wang et al., 2024)	2.0328	1.3383	0.8436	0.1540	4.0033	1.7737	0.6178
THOMAS (Gilles et al., 2021b)	1.9736	1.4388	0.9423	0.1038	3.5930	1.6686	0.5613
HOME+GOHOME (Gilles et al., 2021a)	1.8601	1.2919	0.8904	0.0846	3.6810	1.6986	0.5723
DenseTNT (Gilles et al., 2021b)	1.9759	1.2815	0.8817	0.1258	3.6321	1.6791	0.5843
MultiModalTransformer (Huang et al., 2022)	1.9393	1.2905	0.8372	0.1429	3.9007	1.7350	0.6023
HiVT (Zhou et al., 2022)	1.8422	1.1693	0.7735	0.1267	3.5328	1.5984	0.5473
Mutipath++ (Varadarajan et al., 2022)	1.7932	1.2144	0.7897	0.1324	3.6141	1.6235	0.5645
HPNet(w/o ensemble) (Tang et al., 2024)	1.7375	1.0986	0.7612	0.1067	3.7632	1.7346	0.5514
GAMDTP(ours)	1.7690	1.1256	0.7603	0.1088	3.8807	1.7813	0.5509

Figure 3.

Comparison Our GAMDTP with Baseline.

However, our work is slightly declining in b-minFDE, minFDE and single mode metrics, this shows that GAMDTP fall short in predicting the trajectory endpoints and accuracy of single mode. Specifically, the decline of b-minFDE, FDE of 6-modes and 1-mode indicates that while our method better captures overall motion trends, it pays a small penalty in final-position accuracy. We attribute this to Mamba-ssm’s bias toward global temporal coherence, which can slightly dilute focus on local spatial constraints near trajectory endpoints.

Additionally, to validate the inference speed and the amount of computation for training of our model, we performed inference speed tests on the validation set of Argoverse and recorded the parameter quantity during training phase.

As shown in Table 3, the inference speed of our GAMDTP has been accelerated because the Mamba block is more parallel and computationally efficient. However, the improvement is limited because the model structure still contains the attention mechanism. Moreover, the addition of the Mamba branch resulted in an increase in the number of parameters in the training phase and led to an increase in the training time cost.

Table 3.

Comparison of Inference Time and Parameters on Argoverse Validation Set. For Each Metric, the Best Result is in Bold, the Second Best Result is Underlined.

$M e t h o d$	Inference Time(ms)	Parameters
LaneGCN (Liang et al., 2020)	173	3,710K
DenseTNT (Gu et al., 2021b)	531	1,103K
HPNet (Tang et al., 2024)	70.15	3.6M
GAMDTP(ours)	67.37	3.8M

Results on INTERACTION. Table 4 presents the performance of our method on the INTERACTION (Zhan et al., 2019) multi-agent track, where we achieved state-of-the-art results. Our approach outperformed the first-ranked FJMP (Rowe et al., 2023) by a significant margin, with improvements of 0.0223 in minJointADE, 0.0923 in minJointFDE and 0.0379 in Cross Collision Rate(CCR). Improve about 4% in CCR compared to backbone HPNet (Tang et al., 2024) demonstrates our superior social interaction modeling. The graph attention branch effectively captures inter-agent dependencies, particularly crucial in INTERACTION’s dense merge scenarios. The minJointADE improvement indicates better joint motion consistency, attribute to Mamba-ssm’s global trajectory coherence modeling. This proves especially valuable in roundabouts and interactions where agents’ motions are highly correlated. These results demonstrate that our GAMDTP is both a simple and effective solution for joint trajectory prediction.

Table 4.

Comparison of GAMDTP With the State of the Art Methods on the INTERACTION Test Set. For Each Metric, the Best Result is in Bold, the Second Best Result is Underlined.

$M e t h o d$	$m i n J o i n t A D E ↓$	$m i n J o i n t F D E ↓$	$C C R ↓$
THOMAS (Gilles et al., 2021b)	0.4164	0.9679	0.1791
DenseTNT (Gu et al., 2021b)	0.4195	1.1288	0.2240
Traj-MAE (Chen et al., 2023)	0.3066	0.9660	0.1831
HDGT (Jia et al., 2023)	0.3030	0.9580	0.1938
FJMP (Rowe et al., 2023)	0.2752	0.9218	0.1853
HPNet(w/o ensemble) (Tang et al., 2024)	0.2548	0.8231	0.1480
GAMDTP(ours)	0.2529	0.8295	0.1474

Although our model outperform HPNet in minJointADE and CCR, we still have 0.78% increase in minJointFDE. Both the results of Argoverse and INTERACTION suggests that GAMDTP has a subtle limitation in final-position accuracy due to the Mamba-ssm’s temporal bias may marginally dilute spatial precision at trajectory endpoints.

While our model demostrates superior preformance on the INTERACTION dataset, this advantage appears particularly pronounced in scenarios involving complex multi-agent interactions and safety-critical situations. This suggests that our GAMDTP’s dynamic gating mechanism and quality scoring may be particularly effective at modeling the nuanced interaction patterns and collision avoidance behaviors prevalant in this dataset. However, the performance gap narrows in simpler, more predictable scenarios, indicating that our model’s added complexity may not always be justified when interaction complexity is low.

4.4. Ablation Study

To check the effectiveness of the key components in our model, we conduct a series of ablation experiments on the INTERACTION (Zhan et al., 2019) test set. Specifically, we evaluate the impact of the gate mechanism, quality scoring mechanism and the number of Mamba layers, which represents the number of stacked Mamba layers in GAM module. The results are summarized in Table 5.

Table 5.
Ablation Study on INTERACTION Test Set.

$B a c k b o n e$ $G a t e$ $S c o r e$ $1 - l a y e r$ $3 - l a y e r s$ $5 - l a y e r s$ $m i n J o i n t A D E ↓$ $m i n J o i n t F D E ↓$ $C C R ↓$ $m i n J o i n t M R ↓$

✓ ✓ 0.2641 0.8610 0.1515 0.1717

✓ ✓ ✓ 0.2543 0.8342 0.1473 0.1530

✓ ✓ ✓ 0.2641 0.8614 0.1516 0.1700

✓ ✓ ✓ ✓ 0.2529 0.8295 0.1474 0.1525

✓ ✓ ✓ ✓ 0.2643 0.8614 0.1511 0.1722

✓ ✓ ✓ ✓ 0.2706 0.8687 0.1548 0.1665

$B a c k b o n e$	$G a t e$	$S c o r e$	$1 - l a y e r$	$3 - l a y e r s$	$5 - l a y e r s$	$m i n J o i n t A D E ↓$	$m i n J o i n t F D E ↓$	$C C R ↓$	$m i n J o i n t M R ↓$
✓			✓			0.2641	0.8610	0.1515	0.1717
✓	✓		✓			0.2543	0.8342	0.1473	0.1530
✓		✓	✓			0.2641	0.8614	0.1516	0.1700
✓	✓	✓	✓			0.2529	0.8295	0.1474	0.1525
✓	✓	✓		✓		0.2643	0.8614	0.1511	0.1722
✓	✓	✓			✓	0.2706	0.8687	0.1548	0.1665

Effect of gate mechanism. The gate mechanism significantly influences model performance by selectively integrating self-attention and Mamba outputs. Our analysis reveals that although the gate efficiently balances computational load and feature representation, it employs a relatively simple fusion strategy. As shown in Table 5, the removal of the gate mechanism leads to a noticeable drop in performance, increase minJointADE from 0.2529 to 0.2641, minJointFDE from 0.8295 to 0.8614 and the Cross Collision Rate(CCR) from 0.1474 to 0.1516. These results highlight the importance of dynamically balancing the contributions of GAT and Mamba-SSM for effective feature extraction.

Effect of score mechanism. The quality scoring mechanism evaluates the reliability of trajectory proposals and guides the refinement process. To evaluate its impact, we compare the model with and without this mechanism. The absence of the scoring mechanism results in an increase of 0.0014 in minJointADE, 0.0047 in minJointFDE and CCR slightly increases from 0.1473 to 0.1474. Although the enhancement effect is not as obvious as that of other modules, extensive experiments, refer to 4.5, have shown that the quality scoring mechanism effectively enhances the refinement process by prioritizing reliable trajectories.

Effect of different numbers of Mamba layers. We investigate the impact of varying the number of Mamba layers in GAM module. Specifically, we test configurations with 1, 3 and 5 layers. And all variants share the same Graph Attention structure for fair comparison. The results in Table 5 indicates that more Mamba layers will lead to performance degradation, the minJointADE increases from 0.2529 to 0.2643 in 3 layers and 0.2706 in 5 layers respectively, the minJointFDE increases from 0.8295 to 0.8614 and 0.8687 and the CCR increases from 0.1474 to 0.1511 and 0.1548, this likely that more layers lead to computational redundancy, resulting in difficulty in convergence. That is, Single-layer Mamba already captures sufficient temporal dynamics for most traffic scenarios and additional layers introduce noise rahter than useful features, as evidenced by the CCR increase. To balance comprehensive performance and computational efficiency, we choose 1 Mamba layer for GAM module in our GAMDTP network.

4.5. Implementation Details

Our GAMDTP trains on 1 RTX A6000 GPU for 64 epochs, using the AdamW (Merity et al., 2017) optimizer with a batch size of 4, dropout rate of 0.1, and weight decay of $1 \times 10^{- 4}$ . Initial learning rates are $5 \times 10^{- 4}$ for Argoverse and $3 \times 10^{- 4}$ for INTERACTION, with a cosine annealing scheduler for rate decay. The experimental data in the ablation and inference speed test experiments are obtained by training 5 models in each case and obtaining the average value of the results inference on the test set separately.

On Argoverse, we set a 50 radius (50 meters in real word) for all local areas as the interaction field. On INTERACTION, the radius is 80.

5. Conclusion

In this paper, we introduced GAMDTP, a novel framework for accurate and efficient trajectory forecasting in autonomous driving scenarios. By integrating Mamba-SSM and Graph Attention Networks (GAT) through a dynamic gating mechanism, our model effectively captures interaction features, ensures security and also accelerate inference speed. To further enhance the two-stage trajectory prediction framework, we designed a Quality Scoring Mechanism, which evaluates trajectory proposals and prioritizes high-quality candidates during refinement. Our experimental results on the Argoverse and INTERACTION datasets demonstrate that GAMDTP achieves state-of-the-art performance.

Despite its strengths, our model exhibits certain limitations. It occasionally underperforms compared to baseline method, particularly in predicting trajectory endpoints and single mode prediction. And the integration of Mamba and GAT, while effective, introduces higher computational costs compared to simpler architectures. Specifically, the dynamic gating mechanism increase the parameters relative to baseline models. This trade-off is justified by the improved accuracy but may limit deployment on resource-constrained edge devices.

In summary, GAMDTP offers a scalable and reliable solution for dynamic trajectory forecasting, advancing the capabilities of autonomous driving systems. In the near future, we will focus on addressing these identified limitations. Exploring advanced gating techniques, such as attention-based gating or mixture-of-experts frameworks, could potentially enhance the model’s performance in dense interaction scenarios. Additionally, adaptive mechanism to dynamically manage computational resources based on agent density and interaction complexity could further improve both prediction accuracy and inference speed. Finally, expanding the model’s evaluation across diverse datasets with varying agent densities and interaction complexities would validate its robustness and practical applicability in real-world autonomous driving systems.

Footnotes

ORCID iD

Hongkuo Niu

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Alahi

Goel

Ramanathan

Robicquet

Fei-Fei

Savarese

(2016). Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 961–971).

Azadani

M. N.

Boukerche

(2023). Stag: A novel interaction-aware path prediction method based on spatio-temporal attention graphs for connected automated vehicles. Ad Hoc Networks, 138, 103021.

Carion

Massa

Synnaeve

Usunier

Kirillov

Zagoruyko

(2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.

Chai

Sapp

Bansal

Anguelov

(2019). Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449.

Chang

M. F.

Lambert

Sangkloy

Singh

Bak

Hartnett

Wang

Carr

Lucey

Ramanan

Hays

(2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748–8757).

Chen

Wang

Shao

Liu

Hao

Guan

Chen

Heng

P. A.

(2023). Traj-mae: Masked autoencoders for trajectory prediction. In Proceedings of the IEEE/CVF International conference on computer vision (pp. 8351–8362).

Chen

Zhang

Zhao

Tan

Yang

(2022). Intention-aware vehicle trajectory prediction based on spatial-temporal dynamic attention network for internet of vehicles. IEEE Transactions on Intelligent Transportation Systems, 23(10), 19471–19483.

Dao

(2024). Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060.

Deo

Trivedi

M. M.

(2018). Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 1468–1476).

10.

Ding

Luong

K. D.

Rodriguez

da Silva

A. C. A. L.

Hsu

(2024). Combining graph neural network and mamba to capture local and global tissue spatial relationships in whole slide images. arXiv preprint arXiv:2406.04377.

11.

Gao

Sun

Zhao

Shen

Anguelov

Schmid

(2020). Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11525–11533).

12.

Gilles

Sabatini

Tsishkou

Stanciulescu

Moutarde

(2021a). Home: Heatmap output for future motion estimation. In 2021 IEEE International intelligent transportation systems conference (ITSC) (pp. 500–507). IEEE.

13.

Gilles

Sabatini

Tsishkou

Stanciulescu

Moutarde

(2021b). Thomas: Trajectory heatmap output with learned multi-agent sampling. arXiv preprint arXiv:2110.06607.

14.

Dao

(2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

15.

Goel

Ré

(2021a). Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.

16.

Sun

Zhao

(2021b). Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International conference on computer vision (pp. 15303–15312).

17.

Han

Tang

Wang

Yang

Guo

Wang

(2024). Densemamba: State space models with dense hidden connection for efficient large language models. arXiv preprint arXiv:2403.00818.

18.

Hochreiter

(1997). Long Short-term Memory. Munich, Germany: Neural Computation MIT-Press.

19.

Hou

S. E.

Yang

Wang

Nakano

(2022). Structural transformer improves speed-accuracy trade-off in interactive trajectory prediction of multiple surrounding vehicles. IEEE Transactions on Intelligent Transportation Systems, 23(12), 24778–24790.

20.

Huang

(2022). Multi-modal motion prediction with transformer-based neural network for autonomous driving. In 2022 International conference on robotics and automation (ICRA) (pp. 2605–2611). IEEE.

21.

Jia

Chen

Liu

Yan

(2023). Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11), 13860–13875.

22.

Lee

Choi

Vernaza

Choy

C. B.

Torr

P. H.

Chandraker

(2017). Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 336–345).

23.

Wang

Qiao

(2025). Videomamba: State space model for efficient video understanding. In European conference on computer vision (pp. 237–255). Springer.

24.

Wang

Zuo

(2023). Interaction-aware prediction for cut-in trajectories with limited observable neighboring vehicles. IEEE Transactions on Intelligent Vehicles, 8(3), 2148–2161.

25.

Liang

Yang

Chen

Liao

Feng

Urtasun

(2020). Learning lane graph representations for motion forecasting. In Computer Vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 (pp. 541–556). Springer.

26.

Lieber

Lenz

Bata

Cohen

Osin

Dalmedigos

Safahi

Meirom

Belinkov

Shalev-Shwartz

Abend

Alon

Asida

Bergman

Glozman

Gokhman

Manevich

Ratner

Rozen

Shwartz

,... Shoham

(2024). Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887.

27.

Liu

Meidani

(2024). End-to-end heterogeneous graph neural networks for traffic assignment. Transportation Research Part C: Emerging Technologies, 165, 104695.

28.

Liu

Meidani

(2025). Multi-class traffic assignment using multi-view heterogeneous graph attention networks. arXiv preprint arXiv:2501.09117.

29.

Liu

Zhang

Fang

Jiang

Zhou

(2021). Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7577–7586).

30.

Merity

Xiong

Bradbury

Socher

(2017). In proceedings of the international conference on learning representations.

31.

Ngiam

Caine

Vasudevan

Zhang

Chiang

H. T. L.

Ling

Roelofs

Bewley

Liu

Venugopal

Weiss

Sapp

Chen

Shlens

(2021). Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417.

32.

Phan-Minh

Grigore

E. C.

Boulton

F. A.

Beijbom

Wolff

E. M.

(2020). Covernet: Multimodal behavior prediction using trajectory sets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14074–14083).

33.

Ren

Girshick

Sun

(2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

34.

Rowe

Ethier

Dykhne

E. H.

Czarnecki

(2023). Fjmp: Factorized joint multi-agent motion prediction over learned directed acyclic interaction graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13745–13755).

35.

Schmidt

R. M.

(2019). Recurrent neural networks (rnns): A gentle introduction and overview. arXiv preprint arXiv:1912.05911.

36.

Tang

Kan

Shan

Bai

Chen

(2024). Hpnet: Dynamic trajectory forecasting with historical prediction attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15261–15270).

37.

Team

Lenz

Arazi

Bergman

Manevich

Peleg

Aviram

Almagor

Fridman

Padnos

Gissin

Jannai

Muhlgay

Zimberg

Gerber

E. M.

Dolev

Krakovsky

Safahi

Schwartz

Cohen

,... Shoham

(2024). Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570.

38.

Varadarajan

Hefny

Srivastava

Refaat

K. S.

Nayakanti

Cornman

Chen

Douillard

Lam

C. P.

Anguelov

Sapp

(2022). Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. In 2022 International conference on robotics and automation (ICRA) (pp. 7814–7821). IEEE.

39.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

40.

Velickovic

Cucurull

Casanova

Romero

Lio

Bengio

(2017). Graph attention networks. Stat, 1050(20), 10–48550.

41.

Wang

Tang

Zeng

Pan

Dai

Han

(2024). Mm-transformer: A transformer-based knowledge graph link prediction model that fuses multimodal features. Symmetry, 16(8), 961.

42.

Wang

Zhao

Zhang

Cheng

Yang

(2020). Multi-vehicle collaborative learning for trajectory prediction with spatio-temporal tensor fusion. IEEE Transactions on Intelligent Transportation Systems, 23(1), 236–248.

43.

Xing

Cao

(2019). Personalized vehicle trajectory prediction based on joint time-series modeling for connected vehicles. IEEE Transactions on Vehicular Technology, 69(2), 1341–1352.

44.

Zhan

Sun

Wang

Shi

Clausse

Naumann

Kummerle

Konigshof

Stiller

de La Fortelle

Tomizuka

(2019). Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps. arXiv preprint arXiv:1910.03088.

45.

Zhang

Feng

(2022). Trajectory prediction for autonomous driving using spatial-temporal graph attention transformer. IEEE Transactions on Intelligent Transportation Systems, 23(11), 22343–22353.

46.

Zhang

Liu

Reid

Hartley

Zhuang

Tang

(2025). Motion mamba: Efficient and long sequence motion generation. In European conference on computer vision (pp. 265–282). Springer.

47.

Zhou

Shao

Wang

Waslander

S. L.

Liu

(2024). Smartrefine: A scenario-adaptive refinement framework for efficient motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15281–15290).

48.

Zhou

Wang

Y. H.

Huang

Y. K.

(2023). Query-centric trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17863–17873).

49.

Zhou

Wang

(2022). Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8823–8833).

50.

Zhu

Liao

Zhang

Wang

Liu

Wang

(2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417.

51.

Zyner

Worrall

Nebot

(2019). Naturalistic driver intention and path prediction using recurrent neural networks. IEEE Transactions on Intelligent Transportation Systems, 21(4), 1584–1594.