PFNet: A Phase Fusion Network for Pedestrian Trajectory Prediction

Abstract

Pedestrian trajectory prediction plays a pivotal role in real-world applications such as autonomous driving, unmanned delivery, and intelligent surveillance. However, existing deep learning approaches still face critical challenges, including mode collapse and the generation of unrealistic trajectories in complex environments. To address these limitations, we propose Phase Fusion Network (PFNet), a novel trajectory prediction framework designed to enhance prediction accuracy in intricate digital media scenarios. PFNet introduces an innovative Graph Encoder (GE) that incorporates a probabilistic modeling strategy to better capture spatial features and pedestrian interactions. To mitigate mode collapse, a common limitation in GAN-based methods, PFNet employs a dual-discriminator mechanism that improves both the realism and diversity of predicted trajectories. Additionally, PFNet adopts a two-phase architecture, where the generation phase strengthens spatial representation and the prediction phase refines temporal consistency. Extensive experiments on standard benchmarks, including ETH, UCY, and the Stanford Drone datasets, demonstrate that PFNet consistently outperforms state-of-the-art methods in terms of both Average Displacement Error (ADE) and Final Displacement Error (FDE).

Keywords

Pedestrian trajectory prediction phase fusion network GAN graph encoder dual-discriminator

Introduction

With the rapid development of intelligent autonomous systems such as autonomous vehicles, unmanned aerial vehicles (UAVs), and delivery robots, the demand for advanced models capable of accurately perceiving, interpreting, and predicting human behavior has grown significantly. Among these challenges, pedestrian trajectory prediction is particularly critical, as it directly influences the safety, efficiency, and reliability of autonomous systems operating in dynamic and human-centered environments (Cai et al., 2021; Eiffert et al., 2020a; Li et al., 2020).

Early research in this field primarily employs simple motion models, which are only effective in scenarios with minimal pedestrian interaction. As research progresses, probabilistic models such as Hidden Markov Models and Gaussian Mixture Models are introduced to simulate more complex pedestrian behaviors (Lefèvre et al., 2011; Seeger, 2004). However, these methods rely heavily on hand-crafted features and primarily model local interactions, limiting their effectiveness in capturing the intricate dynamics of crowded or highly interactive environments (Liu et al., 2023).

Recent advances in deep learning, particularly the emergence of Generative Adversarial Networks (GANs), have significantly enhanced pedestrian trajectory prediction. GAN-based models have shown promise in modeling complex and dynamic trajectories, yielding substantial improvements in predictive performance (Kosaraju et al., 2019; Liu et al., 2023; Sadeghian et al., 2019). Despite these gains, traditional GANs suffer from notable limitations in this domain. They typically sample trajectories from fixed distributions—such as the Gaussian distribution—which constrains their ability to model the rich spatial-temporal dependencies inherent in real-world pedestrian behavior. As a result, these models often struggle to recognize behavioral patterns in complex environments and fail to capture nuanced pedestrian interactions. Moreover, conventional GANs that employ a single discriminator are prone to mode collapse, which limits the diversity and adaptability of the generated trajectories. This often leads to unrealistic or implausible predictions, including out-of-distribution (OOD) samples that compromise the model’s reliability (Dendorfer et al., 2021). To mitigate this, recent research has incorporated Graph Neural Networks (GNNs) to better model pedestrian interactions through graph structures. For instance, Shi et al. (2021) employed sparse graph convolution to capture interaction patterns, while Yu et al. (2020) integrated Graph Convolutional Networks (GCNs) into a Transformer architecture to extract both spatial and temporal dependencies. Other approaches, such as Social-BiGAT (Kosaraju et al., 2019) and Social-STGCNN (Mohamed et al., 2020), leverage GNNs to uncover the underlying topological structure of pedestrian dynamics. However, GNN-based models still face challenges in densely populated and highly dynamic scenes, as the extracted features may fail to comprehensively represent complex spatial relationships (Shi et al., 2021).

To effectively address these challenges, a novel framework, Phase Fusion Network (PFNet), is proposed to enhance the accuracy and robustness of pedestrian trajectory prediction, as shown in Figure 1. The framework consists of two core components: the Generation Phase and the Prediction Phase. In the Generation Phase, a GAN architecture is adopted, within which a Graph Encoder (GE) block is designed and a new probability distribution is introduced, allowing the generated data features to more accurately capture spatial characteristics of pedestrians and their interactions with other pedestrians. Moreover, to effectively mitigate the issue of mode collapse caused by a single discriminator, a dual-discriminator design is incorporated. By evaluating the generated data from multiple dimensions, the dual-discriminator mechanism further enhances the model’s ability to model complex environments and pedestrian interactions, which improves the quality and diversity of the generated trajectory features. While the Generation Phase enriches spatial feature data, achieving high-precision trajectory prediction requires effectively combining these features with temporal information. To this end, the Prediction Phase is introduced, which fuses the features generated in the Generation Phase and maps them to the channel dimension. Subsequently, further feature extraction is performed using architectures such as Convolutional Neural Networks(CNN), ensuring that the prediction of pedestrian trajectories is more accurate and reliable. The main contributions of the paper are as follows:

A novel GE block is designed and integrated into the GAN architecture to better capture spatial interaction features in pedestrian trajectory data.

A specific probability distribution is proposed to effectively model the diversity of trajectory data and the topological characteristics of the graph structure, improving both the realism and variety of generated trajectory features.

A dual-discriminator mechanism is introduced to address the common issue of mode collapse in GAN models. By evaluating the quality of generated trajectory features from multiple perspectives, this approach enhances overall model performance.

A PFNet framework is proposed, consisting of a Generation Phase and a Prediction Phase, which collaboratively exploit spatial and temporal dependencies to achieve accurate and robust trajectory prediction.

Figure 1.

The framework of our proposed model PFNet.

Related Work

Graph Representation Learning

Graph Representation Learning provides a powerful framework for modeling graph-structured data. Among the various methods, Graph Convolutional Networks (GCNs) (Kipf & Welling, 2016a; Li et al., 2015) have emerged as a foundational approach, demonstrating strong performance across diverse domains, including physical system modeling (Battaglia et al., 2016; Li et al., 2018), healthcare (Liu et al., 2019), and recommendation systems (Fan et al., 2019). Building on their effectiveness, GE in this paper is designed following GCN principles. Despite their success, GCNs exhibit notable limitations. In deep architectures, repeated message passing across layers can lead to over-smoothing, where node representations become increasingly similar and lose discriminative power (Yang et al., 2022). Conversely, shallow GCNs often fail to capture sufficient structural information, resulting in limited expressive capacity.

To address these challenges, we propose an enhanced GCN-based architecture that decouples the standard GCN operations, mitigating the trade-off between network depth and representational power.

Generative Networks for Pedestrian Trajectory Prediction

Generative networks have gained increasing attention in trajectory prediction tasks due to their ability to produce multiple plausible outcomes and model the high uncertainty inherent in pedestrian motion. For example, Gu et al. (2022a) developed a diffusion model based on the Transformer architecture to capture temporal dependencies in trajectories. Kosaraju et al. (2019) proposed a GAN that integrates attention mechanisms with LSTM, enabling the generation of realistic trajectories that respect both social and physical constraints. Similarly, Sadeghian et al. (2019) introduced a GAN-based method that incorporates metric learning to structure the latent space, effectively capturing semantic context and trajectory geometry, which improves the quality of generated trajectories. Despite these advances, generative models continue to face challenges such as training instability and the generation of unrealistic trajectories.

To address these issues, we propose PFNet, which defines a more expressive probability distribution and introduces key components including a GE block and a dual-discriminator mechanism. In addition, the incorporation of a Prediction Phase allows for deeper fusion of spatial and temporal features, leading to significantly enhanced performance in trajectory prediction tasks.

Method

Trajectory prediction aims to forecast the future position coordinates of pedestrians. Given a sequence of observed positions over the time steps ${1, 2, \dots, T_{obs}}$ , the objective is to predict their future positions at time steps ${T_{obs} + 1, T_{obs} + 2, \dots, T_{\,pred}}$ .

This section presents a detailed description of the proposed PFNet framework. The Generation Phase is designed to produce feature representations enriched with graph-structured interaction information. These features are then passed to the Prediction Phase, which leverages both spatial and temporal cues to generate accurate predictions of future pedestrian trajectories.

Data Preprocessing

Since the original pedestrian trajectory data is extracted from video frames and lacks explicit feature annotations (Lv et al., 2023), preprocessing is required before the data can be fed into the model. Given that the Generation Phase involves graph representation learning, the raw data must first be transformed into a graph-structured format.

A temporal graph $G_{t} = (V_{t}, A_{t})$ is constructed to represent the positions of pedestrians and their pairwise spatial interactions at each time step. In this graph, $V_{t} \in R^{T \times N \times D}$ denotes the node feature matrix, where $T$ is the number of time steps, $N$ is the number of pedestrians, and $D$ is the dimensionality of the positional data. Each element of $V_{t}$ contains the position coordinates of a pedestrian at a specific time. The adjacency matrix $A_{t} \in R^{T \times N \times N}$ encodes spatial interactions between pedestrian pairs.

These interactions are quantified using the $L_{2}$ norm of positional differences between pairs of pedestrians, as defined by:

L_{2} = \sqrt{(x_{1} - x_{2})^{2} + (y_{1} - y_{2})^{2}}

(1)

where

(x_{1}, y_{1})

and

(x_{2}, y_{2})

represent the coordinates of pedestrians 1 and 2, respectively.

Based on Equation 1, each element $A_{t} (i, j)$ in the adjacency matrix is computed according to the following rule:

A_{t} (i, j) = {\begin{matrix} 1, & if i = j \\ \frac{1}{\sqrt{(x_{i} - x_{j})^{2} + (y_{i} - y_{j})^{2}}}, & if i \neq j and L_{2} \neq 0 \\ 0, & if i \neq j and L_{2} = 0 \end{matrix}

(2)

To effectively process multi-time-step data and capture pedestrian interactions from a broader temporal perspective, a block-diagonal adjacency matrix $A \in R^{T N \times T N}$ is constructed, as shown in Equation 3. This matrix is formed by stacking the individual adjacency matrices $A_{t}$ along the diagonal, which preserves temporal independence while embedding the temporal sequence into a unified graph structure.

To maintain consistency with the block-diagonal matrix $A$ , the node feature matrix $V_{t} \in R^{T \times N \times D}$ is reshaped into $V \in R^{T N \times D}$ by concatenating the features across time steps. This ensures alignment between the node features and the unified graph structure. Based on this construction, a new graph $\hat{G} = (V, A)$ is defined, representing both spatial and temporal interactions among pedestrians throughout the observation period.

A = [\begin{matrix} A_{0} & 0 & 0 & \dots & 0 \\ 0 & A_{1} & 0 & \dots & 0 \\ 0 & 0 & A_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & A_{T_{obs} - 1} \end{matrix}]

(3)

Generation Phase

To address the challenges outlined in Section “Related Work”, a GAN architecture is proposed that integrates a dual-discriminator mechanism with graph representation learning. The core components of this architecture include a GE, a Generator ( $G$ ), and two discriminators ( $D_{v}$ and $D_{Q}$ ). The GE transforms graph-structured data into compact and expressive latent representations, while the generator and discriminators are trained through an adversarial learning framework. This setup enhances the realism and diversity of the generated data while improving model robustness and generalization capabilities. A probability distribution $P_{z}$ is also defined within this architecture. It serves both as the optimization target for the GE and as the input distribution for the generator. The overall structure of the Generation Phase is illustrated in Figure 2.

Figure 2.

The framework of Generation Phase.

During this phase, the adjacency matrix $A$ and node feature matrix $V$ are first input into the GE to generate their respective latent representations. Simultaneously, data sampled from $P_{z}$ is passed to the generator to produce synthetic features. These generated features, along with the latent representation of $V$ , are then fed into $D_{Q}$ for adversarial training. In parallel, the original feature matrix $V$ and the generator’s outputs are jointly input into $D_{v}$ to further guide the learning process through adversarial signals. To enhance the representational capacity of the GE, an additional optimization objective is applied by backpropagating the difference between the features learned by the GE and those derived from the original data. This encourages the GE to retain meaningful structural information.

Overall, the architectural design not only preserves the topological structure of the graph but also improves the quality of the generated representations, thereby providing more reliable and informative features for the Prediction Phase.

Graph Encoder

To address the limitations caused by either excessive or insufficient GCN layers, a Feature Mapping and Propagation (FMP) architecture is introduced. This design builds upon the theoretical foundation of GCN by decoupling its message-passing operations into distinct components, resulting in a Message Passing Layer (MPL) and a Feed-Forward Layer (FFL). Specifically, multiple FFLs are applied prior to the MPL to enable deep and non-linear modeling of raw node features, thereby capturing complex relationships within the data. In parallel, the MPL aggregates these features to extract topological structure information. The formulation of the FMP architecture is presented below.

According to Kipf and Welling (2016b), the standard GCN operation can be expressed as:

Z = f (V, A) = softmax (\hat{A} ReLU (\hat{A} V W^{(0)}) W^{(1)})

(4)

where

\hat{A} \in R^{N \times N}

denotes the normalized adjacency matrix.

In GCN, message passing corresponds to the two multiplications involving $\hat{A}$ . Based on this, the MPL is formulated as:

V_{v}^{(l)} = \sum_{v \in N_{u} \cup {u}} \frac{1}{\sqrt{{\hat{d}}_{u} {\hat{d}}_{v}}} \cdot (A_{u v} + I_{u v}) \cdot V_{v}^{(l)}

(5)

where

l

denotes the current layer, and

{\hat{d}}_{u}

{\hat{d}}_{v}

represent the degrees of nodes

u

and

v

, including self-loops.

In this work, the FFL is implemented using a multilayer perceptron (MLP), defined as:

V_{v}^{(l)} = W^{(l)} (V_{v}^{(l - 1)})

(6)

Greedy Energy Distance Sampling

Sampling directly from a normal distribution as the prior is often inappropriate, particularly when dealing with graph data that exhibit complex topological structures.

A commonly used alternative is Kernel Density Estimation (KDE), which models the distribution of node features in a non-parametric manner, as shown in Equation 7. KDE avoids strong assumptions about the distribution’s form and allows for flexible estimation of feature distributions. However, since KDE estimates the distribution solely based on $V$ , it fails to incorporate the graph’s structural information, potentially resulting in a distribution that does not accurately reflect the data’s topology.

P_{z} (z ∣ V) = \frac{1}{n b} \sum_{i = 1}^{n} K (\frac{z - V_{i}}{b})

(7)

To address this issue, a new distribution $P_{z} (z ∣ V, A)$ is constructed as an approximation of $P_{z} (z ∣ V)$ , serving as the sampling distribution in this work. To implement this, a Greedy Energy Distance Sampling (GEDS) method is employed. GEDS selects a representative and diverse subset of samples that captures both feature and structural information by maximizing the energy distance between the selected subset and the full sample set.

The Energy Distance (ED) between two distributions $P$ and $Q$ is defined as:

\begin{aligned} {ED}^{2} (P, Q) & = 2 E_{x \sim P, y \sim Q} [‖ x - y ‖] \\ - E_{x, x^{'} \sim P} [‖ x - x^{'} ‖] - E_{y, y^{'} \sim Q} [‖ y - y^{'} ‖] \end{aligned}

(8)

where

P

and

Q

denote the full sample set and the selected subset, respectively.

The GEDS procedure follows a greedy strategy and consists of the following steps:

Step 1: For each pair of nodes $i$ and $j$ , compute $D (i, j)$ as the shortest-path distance derived from the adjacency matrix $A$ , forming a metric matrix $D$ used to measure the energy distance.

Step 2: A subset $S$ of size $k$ is selected from the full sample set using a greedy algorithm that maximizes the ED value between $S$ and the full set. In each iteration, the sample that contributes the most to the increase in ED is added to $S$ , until $k$ samples are selected:

S = \underset{| s | = k}{\arg max} E D^{2} (S)

(9)

Step 3: To reduce redundancy, Principal Component Analysis (PCA) is applied to the selected sub-feature matrix $Z = V [S]$ for dimensionality reduction. This ensures that the dominant feature information is preserved while improving the representativeness of the subset. Finally, the KDE method is applied to the reduced matrix to generate the final distribution $P_{z} (z ∣ V, A)$ .

The effectiveness of GEDS in generating a topology-aware sampling distribution is further analyzed in the subsequent section.

Theoretical Analysis of GEDS

Principal Component Analysis (PCA) is applied to reduce the dimensionality of the selected feature matrix $Z$ , yielding a compact embedding $H_{S}$ . Assuming that the projected features $h_{i} \in H_{S}$ are independently and identically distributed, a structure-aware prior distribution $P_{z} (z ∣ V_{S}, A_{S})$ is estimated using Kernel Density Estimation (KDE), as defined below:

P_{z} (z ∣ V_{S}, A_{S}) = \frac{1}{k} \sum_{i = 1}^{k} K_{b} (z - h_{i}) = \frac{1}{k b} \sum_{i = 1}^{k} K (\frac{z - h_{i}}{b}),

(10)

where

K (\cdot)

denotes the kernel function and

b

is the bandwidth parameter.

As illustrated in Equation 11, GEDS progressively refines the sampling distribution, transitioning from a structure-agnostic prior to a fully structure-aware distribution:

P_{z} (z) \to P_{z} (z ∣ V) \to P_{z} (z ∣ V, A) \approx P_{z} (z ∣ V_{S}, A_{S})

(11)

This progression ensures that the learned prior more accurately captures the underlying structure of the graph data, thereby improving the alignment between the latent space and the true data manifold.

Adversarial Loss

The core of generative networks lies in the adversarial loss, which is primarily designed to enable adversarial training between the generator and the discriminator. Its goal is to minimize the discrepancy between the real distribution $v \sim P_{g} (v ∣ V)$ and the sampled latent distribution $z \sim P (z ∣ V, A)$ . To ensure the stability of the Generation Phase, the WGAN-GP training strategy is adopted, as defined in Equation 12:

\begin{aligned} L_{WGAN-GP} & = E_{v \sim P_{g}} [f_{D} (G E (v))] - E_{z \sim P} [f_{D} (z)] \\ + λ E_{\hat{v} \sim P_{\hat{v}}} [{(‖ \nabla_{\hat{v}} f_{D} (\hat{v}) ‖_{2} - 1)}^{2}] \end{aligned}

(12)

where

f_{D} (\cdot)

denotes the discriminator function, and

G E (\cdot)

represents the graph encoder.

The loss functions for GE, discriminators $D_{Q}$ and $D_{v}$ , and the Generator $G$ , are defined as follows:

Adversarial Loss of $D_{Q}$ and $G E$

The latent representation of $V$ , generated by the GE and denoted as $Q \sim G E_{v \sim P_{g} (v ∣ V)} (v)$ , is treated as a negative sample. In contrast, the sampled latent variable $z \sim P (z ∣ V, A)$ serves as a positive sample. Both are input into $D_{Q}$ for adversarial training, as formulated below:

\begin{aligned} L_{D_{Q}} = & E_{Q \sim G E (v)} [D_{Q} (Q)] - E_{z \sim P} [D_{Q} (G (z))] \\ + λ E_{\hat{v} \sim P_{\hat{v}}} [{‖ \nabla_{\hat{v}} D_{Q} (\hat{v}) - 1 ‖}^{2}] \end{aligned}

(13)

Following the WGAN-GP formulation, the loss for GE is defined as:

L_{G E} = - E_{Q \sim G E (v)} [D_{Q} (Q)]

(14)

Adversarial Loss of $D_{v}$ and $G$

Similarly, the real feature data $v \sim P_{g} (v ∣ V)$ and generated data $G (z)$ , with $z \sim P (z ∣ V, A)$ , are input into the discriminator $D_{v}$ as positive and negative samples, respectively. The adversarial loss is computed as:

\begin{aligned} L_{D_{v}} = & E_{z \sim P} [D_{v} (G (z))] - E_{v \sim P_{g}} [D_{v} (v)] \\ + λ E_{\hat{v} \sim P_{\hat{v}}} [{‖ \nabla_{\hat{v}} D_{v} (\hat{v}) - 1 ‖}^{2}] \end{aligned}

(15)

The corresponding generator loss is:

L_{G} = - E_{z \sim P} [D_{v} (G (z))]

(16)

Loss of $G E$

As described in Section 3.2, to ensure the quality of the data generated by $G$ , an additional structure-aware loss is introduced for GE.

First, the latent representation of the adjacency matrix, denoted as $Q_{A}$ , is transformed according to the method proposed in Zheng et al. (2020), as shown in Equation 17:

A^{'} = sigmoid (Q_{A} \cdot Q_{A}^{T})

(17)

where

A^{'}

denotes the reconstructed adjacency matrix.

Then, the differences between $A^{'}$ and the ground-truth adjacency matrix $A$ , as well as between the GE-generated features and the original features $V$ , are computed. These are optimized using cross-entropy loss to guide the encoder toward preserving both feature and structural information. The final loss function for the GE is:

L_{G E n c} = L_{G E} + α L_{t o t a l}

(18)

where

L_{t o t a l} = L_{f e a t} + α L_{A d j}

, and the components are defined as follows:

\begin{aligned} L_{\,feat} = & - \sum_{i = 1}^{N} \sum_{j = 1}^{D} [V_{i j} \log G_{Q \sim G E (V)} (Q_{i j}) \\ + (1 - V_{i j}) \log (1 - G_{Q \sim G E (V)} (Q_{i j}))] \end{aligned}

(19)

L_{Adj} = - \sum_{i = 1}^{N} \sum_{j = 1}^{N} [a_{i j} \log (a_{i j}^{'}) + (1 - a_{i j}) \log (1 - a_{i j}^{'})]

(20)

Prediction Phase

Through adversarial learning in the Generation Phase, feature representations enriched with graph structural information, $H \in R^{T_{obs} \times N \times D}$ , are generated and then passed to the Prediction Phase. The architecture of the Prediction Phase comprises Spatial Convolution (SC) and Temporal Convolution (TC), which further extract spatial and temporal features from $H$ . The overall structure of the generation stage is illustrated in Figure 3.

Figure 3.

The framework of Prediction Phase.

To facilitate spatial feature extraction, a transformation is first applied to $H$ along the feature dimension, resulting in a reshaped tensor $H \in R^{D \times T_{obs} \times N}$ . Subsequently, the processed tensor $V_{t}$ , after passing through multiple spatial convolutional layers, is concatenated with $H$ along the feature dimension.

During the extraction of features from $V_{t}$ , residual connections are employed to retain information from the original input and enhance training stability. This allows the model to effectively capture both fine-grained local details and global spatial patterns. The entire process is formulated as:

\hat{H} = Concat (W^{h} H, S C (V_{t}) + S C_{1} (V_{t}))

(21)

where

\hat{H} \in R^{2 \hat{D} \times T_{obs} \times N}

, and

S C_{1}

denotes the first layer of the Spatial Convolution operation.

To model temporal dynamics and generate accurate future trajectories for each pedestrian, Temporal Convolution (TC) is applied to $\hat{H}$ . To further improve feature propagation and mitigate gradient vanishing, multiple residual connections are introduced. The process is expressed as:

\tilde{C} = T C (T C (\hat{H}) + T C_{1} (\hat{H})) + T C_{1} (\hat{H})

(22)

where

\tilde{C} \in R^{T_{\,pred} \times 2 \hat{D} \times N}

, and

T C_{1}

denotes the first layer of the Temporal Convolution operation.

Loss Function

This formulation follows the assumptions established in Lv et al. (2023) and Eiffert et al. (2020b), where the predicted trajectory coordinates for each pedestrian, obtained from the Prediction Phase, are modeled as following a bivariate Gaussian distribution. Specifically, for the $n$ -th pedestrian at time step $t$ , the predicted coordinates ${\tilde{C}}_{t}^{n} = ({\tilde{x}}_{t}^{n}, {\tilde{y}}_{t}^{n})$ are assumed to follow the distribution $N (μ_{t, x}^{n}, μ_{t, y}^{n}, σ_{t, x}^{n}, σ_{t, y}^{n}, ρ_{t}^{n})$ .

Let $C_{t}^{n} = (x_{t}^{n}, y_{t}^{n})$ denote the ground-truth coordinates of pedestrian $n$ at time $t$ . The loss function for the Prediction Phase is derived via maximum likelihood estimation and is defined as:

L = - \sum_{t = T_{obs} + 1}^{T_{\,pred}} \log [P ((x_{t}^{n}, y_{t}^{n}) ∣ N (μ_{t}^{n}, σ_{t}^{n}, ρ_{t}^{n}))]

(23)

Quantitative Evaluation

Experimental Setup

Evaluation experiments are conducted on publicly available benchmark datasets, including ETH (Pellegrini et al., 2010), UCY (Lerner et al., 2007), and the Stanford Drone Dataset (SDD) (Robicquet et al., 2016). The ETH and UCY datasets cover five distinct scenes: ETH, HOTEL, UNIV, ZARA1, and ZARA2, providing position trajectories for 1,536 pedestrians over multiple time steps. The SDD dataset consists of high-resolution aerial videos recorded above the Stanford campus, with densely annotated trajectories of pedestrians, cyclists, and vehicles navigating complex urban environments.

Following previous methods, the first 3.2 seconds ( $T_{obs} = 8$ frames) are used as the observation window, followed by a 4.8-second prediction horizon ( $T_{\,pred} = 12$ frames).

The Average Displacement Error (ADE) and Final Displacement Error (FDE) are used as evaluation metrics to quantitatively assess the performance of PFNet and other state-of-the-art (SOTA) models. The metrics are defined as follows:

A D E = \frac{1}{N T_{\,pred}} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{\,pred}} {‖ {\tilde{C}}_{t}^{n} - C_{t}^{n} ‖}_{2}

(24)

F D E = \frac{1}{N} \sum_{n = 1}^{N} {‖ {\tilde{C}}_{T_{\,pred}}^{n} - C_{T_{\,pred}}^{n} ‖}_{2}

(25)

In the PFNet model, the hyperparameter $α$ in the Generation Phase is set to 0.01. The FMP architecture consists of two layers each of the Feed-Forward Layer (FFL) and Message Passing Layer (MPL). The Generation and Prediction Phases are trained for 100 and 250 epochs, respectively. Learning rates for $D_{Q}$ , $D_{v}$ , $G E$ , and $G$ are uniformly set to 0.0001, while the learning rate for the Prediction Phase is set to 0.01. All experiments are conducted on an RTX 4090D GPU using the PyTorch 2.5 framework with CUDA 12.1.

Comparison with SOTA Methods

The following SOTA methods are selected as baselines for comparison:

Social-GAN (Liu et al., 2023): A pedestrian trajectory prediction method based on a recurrent sequence-to-sequence architecture that employs a novel pooling mechanism and a recurrent discriminator.

SR-LSTM (Zhang et al., 2019): A data-driven LSTM-based model that incorporates message passing and socially-aware information selection to predict pedestrian trajectories based on the intentions of nearby agents.

STAR (Yu et al., 2020): A spatiotemporal graph-transformer framework that integrates graph convolution and temporal attention mechanisms for trajectory prediction.

PECNet (Mangalam et al., 2020): A human trajectory prediction model that uses a non-local social pooling layer and a truncation strategy to enhance multimodal prediction.

Trajectron++ (Salzmann et al., 2020): A modular, graph-structured recurrent model for multi-agent trajectory forecasting, combining agent dynamics with contextual environmental information.

AgentFormer (Yuan et al., 2021): A Transformer-based multi-agent trajectory prediction model that jointly captures temporal and social dependencies through an agent-aware attention mechanism.

SGCN (Shi et al., 2021): A sparse graph convolutional network designed for pedestrian trajectory prediction that models directed interactions and motion trends using sparse spatiotemporal graphs.

MID (Gu et al., 2022b): A model that encodes motion uncertainty through a parameterized Markov chain and balances trajectory diversity and determinism via a Transformer-based diffusion process.

TUTR (Shi et al., 2023): A Transformer-based model that unifies trajectory components, social interactions, and multimodal prediction within a single framework.

BOsampler (Chen et al., 2023): A Bayesian optimization-based sampling method that models trajectory prediction as a Gaussian process and adaptively explores diverse plausible paths.

LED (Mao et al., 2023): A diffusion-based trajectory prediction framework featuring a learnable jump initializer to accelerate inference while producing accurate and diverse multimodal predictions in real time.

UTD-PTP (Tang et al., 2024): A transformer-based diffusion model designed for complex campus scenarios, leveraging digital twin environments to improve prediction accuracy and generalization.

LADM (Lv et al., 2024): A VAE-based diffusion model that incorporates pedestrian group relationships via a pedestrian–group interaction module (PGIM), enhancing multimodal trajectory prediction through a diffusion-based refinement process.

DTGAN (Xie et al., 2024): A GAN-based model for graph sequence data that automatically captures implicit social interactions using random-weighted graphs, achieving improved trajectory prediction without relying on predefined interaction rules.

Comparison with SOTA Methods

Experiments on the ETH and UCY datasets are conducted to evaluate the performance of PFNet, with results summarized in Table 1. As shown, the proposed model achieves the best average ADE and FDE across all five benchmark subsets, consistently outperforming existing SOTA methods. Notably, compared to the early Social-GAN model, PFNet reduces ADE by 43% and FDE by 89%.

Table 1.

Comparison of SOTA on the ETH/UCY dataset (ADE/FDE). Bold Indicates the Best Result in Each Column.

Method	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social-GAN	0.87/1.62	0.67/1.37	0.76/1.52	0.35/0.68	0.42/0.84	0.61/1.21
DTGAN	0.68/1.43	0.30/0.52	0.51/1.07	0.31/0.67	0.28/0.59	0.42/0.86
SGCN	0.58/0.99	0.31/0.53	0.37/0.67	0.29/0.51	0.23/0.42	0.36/0.62
STAR	0.57/1.11	0.19/0.37	0.35/0.75	0.26/0.57	0.25/0.58	0.32/0.68
PECNet	0.61/1.07	0.22/0.39	0.34/0.56	0.25/0.45	0.19/0.33	0.32/0.56
SR-LSTM	0.43/0.65	0.24/0.42	0.38/0.53	0.28/0.43	0.24/0.32	0.31/0.47
Trajectron++	0.61/1.03	0.20/0.28	0.30/0.55	0.24/0.41	0.18/0.32	0.31/0.52
BOsampler	0.52/0.95	0.19/0.39	0.30/0.67	0.14/0.33	0.20/0.45	0.27/0.56
AgentFormer	0.46/0.80	0.14/0.22	0.25/0.45	0.18/0.30	0.14/0.24	0.23/0.40
TUTR	0.40/0.61	0.11/0.18	0.23/0.42	0.18/0.34	0.13/0.25	0.21/0.36
UTD-PTP	0.38/0.63	0.13/0.22	0.21/0.43	0.18/0.33	0.13/0.26	0.21/0.37
LADM	0.34/0.63	0.15/0.20	0.19/0.46	0.15/0.17	0.13/0.26	0.20/0.34
MID	0.33/0.62	0.21/0.23	0.20/0.45	0.14/0.27	0.14/0.24	0.20/0.36
LED	0.39/0.58	0.11/0.17	0.26/0.43	0.18/0.26	0.13/0.22	0.21/0.33
PFNet	0.32/0.58	0.13/0.12	0.22/0.42	0.13/0.24	0.12/0.23	0.18/0.32

This significant improvement can be attributed to the incorporation of the GE block into the GAN framework, which enhances the model’s ability to capture spatial interactions among pedestrians. Moreover, the dual-discriminator mechanism strengthens the representational power of the generated data, improving both quality and diversity.

When compared to probabilistic models such as BoSampler and LED, PFNet exhibits stronger modeling capability and superior prediction accuracy. This advantage stems from the introduction of the GEDS method, which avoids the limitations of fixed prior distributions commonly used in traditional methods. In contrast, BoSampler and LED rely on relatively simple sampling strategies that do not fully account for graph structure and topological dependencies. As a result, they may struggle to model complex spatial relationships, leading to suboptimal prediction outcomes in crowded or dynamic scenes.

In comparison to SGCN and STAR—both of which also leverage spatial-temporal modeling—PFNet demonstrates a 17% and 15% reduction in ADE, and a 31% and 36% reduction in FDE, respectively. While SGCN and STAR incorporate innovations in graph structure learning and attention mechanisms, their reliance on GNNs alone limits their capacity to model complex generative distributions. In contrast, PFNet enhances the model’s ability to capture intrinsic data distributions through the integration of the GE block and iterative adversarial training. Furthermore, the incorporation of a dedicated Prediction Phase allows the graph-structured representations learned during generation to be effectively utilized for trajectory prediction, leading to more accurate and robust forecasts.

We further compare PFNet with two diffusion-based models, LADM and UTD-PTP. LADM achieves competitive ADE/FDE scores (0.19/0.34 on average), especially on structured scenes like UNIV and ZARA1, but is slightly outperformed by PFNet in more dynamic environments such as ETH. UTD-PTP also shows solid performance (0.21/0.37), particularly on HOTEL, yet its lack of explicit graph structural modeling limits its effectiveness in scenarios with complex pedestrian interactions. In contrast, PFNet achieves the best results across all five datasets and delivers more consistent predictions, benefiting from its structure-aware design and adversarial training framework.

Finally, we include DTGAN in the comparison, a GAN-based model that learns implicit social interactions via graph representations with random weights. However, its lack of explicit structure-aware priors and temporal fusion leads to underperformance, particularly on highly interactive scenes like UNIV and ZARA1. PFNet surpasses DTGAN by 57% in ADE and 62% in FDE.

These quantitative results validate the superior accuracy and generalization capability of PFNet, which can be attributed to its holistic architecture that integrates graph encoding, structure-aware prior modeling, dual adversarial training, and temporal decoding.

We further evaluate PFNet on the challenging Stanford Drone Dataset (SDD), which features diverse pedestrian behaviors and complex scene layouts. As reported in Table 2, PFNet achieves competitive performance compared to recent SOTA models. Specifically, it attains an ADE of 8.41 and an FDE of 12.13, outperforming the diffusion-based LED model and significantly improving upon Trajectron++. While MID achieves a slightly lower ADE, PFNet offers a better balance between short-term and long-term prediction accuracy, as reflected in its superior FDE. These results demonstrate the effectiveness of PFNet in generating accurate and stable trajectory predictions under complex real-world conditions, further confirming its robustness and generalization capability beyond the ETH/UCY benchmark.

Table 2.

Comparison of SOTA on the SDD Dataset (ADE/FDE). Bold Indicates the Best Result in Each Column.

Method	ADE	FDE
Trajectron++	18.27	31.22
LED	8.67	12.34
MID	7.87	15.37
PFNet	8.41	12.13

To provide a more intuitive understanding of the prediction results, trajectory visualizations on the ETH and UCY datasets are presented in Figure 4. As shown, the trajectories predicted by PFNet align more closely with the ground-truth paths compared to those generated by the SOTA LED model.

Figure 4.

Visualization of pedestrian trajectory prediction results on the ETH/UCY dataset. Observed trajectories are shown as solid blue lines, ground truth trajectories as solid orange lines, PFNet predictions as dashed red lines, and LED predictions as solid green lines across five scenarios. PFNet predictions demonstrate superior alignment with the ground truth, indicating improved prediction performance. For enhanced clarity, we recommend viewing the figure in color and at an enlarged scale.

Ablation Studies

To further analyze the contribution of each component in the proposed PFNet, ablation studies were conducted on the five benchmark datasets. In these experiments, specific modules of PFNet were either removed or replaced with alternative implementations. The detailed ablation settings are as follows:

PFNet-w/o-GEDS: The GEDS method is replaced with sampling from a Gaussian distribution.

PFNet-w/o-FMP: The Feature Mapping and Propagation (FMP) module is replaced with a standard two-layer GCN.

PFNet-w/o-Concat: The feature fusion operation in the Prediction Phase is replaced with element-wise addition instead of concatenation.

PFNet-w/o-Residual: Residual connections in the Prediction Phase are removed.

PFNet-w/o-discriminator: The dual-discriminator mechanism is replaced with a single discriminator.

The results of the ablation studies are presented in Table 3. As shown, PFNet-w/o-GEDS exhibits a notable performance drop across all datasets due to the substitution of ED-based sampling with a Gaussian distribution. This performance degradation arises because the GEDS method captures the distributional characteristics of trajectory data more effectively, providing a more accurate probability distribution during the Generation Phase. This allows the generator $G$ to learn a latent space that aligns more closely with the true data distribution. Although the Gaussian distribution is widely used in image generation tasks, it proves less effective in pedestrian trajectory prediction. The complexity of spatial structural information in pedestrian scenarios is not well modeled by a Gaussian prior, which leads to suboptimal latent representations. As a result, the generated trajectories deviate further from the ground truth, degrading performance in both ADE and FDE metrics.

Table 3.

Comparison of Ablation Studies on PFNet (ADE/FDE). Bold Indicates the Best Result in Each Column.

Method	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
PFNet-w/o-FMP	0.92/1.82	0.87/1.67	0.86/1.65	0.45/0.72	0.52/0.95	0.72/1.36
PFNet-w/o-GEDS	1.92/2.32	2.14/2.78	1.96/2.45	1.75/2.32	1.88/1.95	1.93/2.36
PFNet-w/o-Concat	0.41/0.67	0.25/0.44	0.36/0.51	0.23/0.34	0.31/0.36	0.31/0.46
PFNet-w/o-Residual	0.33/0.57	0.22/0.34	0.31/0.52	0.21/0.32	0.19/0.33	0.25/0.42
PFNet-w/o-discriminator	0.91/1.73	0.87/1.73	0.82/1.81	0.42/0.73	0.62/0.94	0.75/1.37
PFNet	0.32/0.58	0.13/0.12	0.22/0.42	0.13/0.24	0.12/0.23	0.18/0.32

The performance decline observed in PFNet-w/o-FMP is primarily attributed to the replacement of the FMP architecture with a conventional two-layer GCN. The original FMP module, by employing multiple Feed-Forward Layers (FFLs) and integrating several Message Passing Layers (MPLs), facilitates deeper, non-linear feature extraction while preserving structural flexibility. Its decoupled design enhances the model’s capacity to capture intricate relationships and topological dependencies within the data. In contrast, traditional GCN architectures lack this flexibility and expressiveness, leading to weaker feature learning and ultimately poorer performance in both ADE and FDE. These findings underscore the effectiveness of stacking multiple FFLs and MPLs in the FMP architecture for capturing meaningful spatio-temporal representations.

While PFNet-w/o-Concat performs better than the previous two variants, it still falls short of the full model. Replacing the concatenation operation with element-wise addition reduces the expressiveness of feature fusion in the Prediction Phase, limiting the model’s ability to retain and integrate both original and generated features, thus negatively affecting prediction accuracy.

PFNet-w/o-Residual shows only a slight decline in performance, indicating that residual connections contribute positively to model stability. Their inclusion helps preserve feature information and alleviate gradient vanishing during training, thereby supporting improved convergence and prediction performance in the full PFNet model.

Using a single discriminator (PFNet-w/o-discriminator) results in a 76% ADE and 77% FDE increase on average, compared to the full model. This significant degradation highlights the importance of the dual-discriminator design in enhancing both the accuracy and stability of trajectory generation.

In summary, the ablation study demonstrates that each component plays a critical role in the overall performance of PFNet. The careful architectural design and integration of these modules collectively enhance the model’s ability to accurately and robustly predict pedestrian trajectories in complex scenarios.

Model Efficiency Analysis

To assess the practicality of PFNet in real-world scenarios, we conduct a comprehensive evaluation of model efficiency, including parameter size, training time, and inference latency, as summarized in Table 4.

Table 4.

Comparison of Model Efficiency in Terms of Parameters, Training, and Inference Time. Bold Indicates the Best Result in Each Column.

Model	Parameters count (M)	Training time (s)	Inference time (ms)
MID	9043.22	8474.56	25.41
LADM	1709.61	7394.36	6.15
STAR	964.90	6499.21	4.37
PFNet	9.78	7054.68	4.211

PFNet demonstrates significant advantages in terms of computational efficiency. Despite incorporating a dual-phase architecture and adversarial training mechanism, PFNet requires only 9.78 million parameters—substantially fewer than MID (9043.22M), LADM (1709.61M), and STAR (964.90M). In addition, PFNet achieves a competitive training time of 7054.68 seconds, indicating stable and efficient optimization during training. Most notably, it achieves the lowest inference time of 4.211 milliseconds, enabling responsive prediction in time-sensitive applications such as autonomous driving and intelligent surveillance.

These results collectively highlight PFNet’s efficient design and strong deployment potential, demonstrating that it can deliver high prediction accuracy with manageable computational overhead.

Robustness Analysis

To assess the stability and reliability of PFNet, the standard deviations of ADE and FDE across the five subsets of the ETH/UCY benchmark are reported in Table 5, based on five independent runs. The results show that PFNet consistently achieves low standard deviations, indicating robust and stable performance across diverse scenarios. For instance, in the ETH and UNIV scenes, which involve complex pedestrian interactions, PFNet maintains standard deviations of only 0.068 (ADE) and 0.012 (FDE), respectively. Similarly, in structured environments such as HOTEL and ZARA1, the fluctuations remain minimal, with standard deviations as low as 0.013. These observations demonstrate that PFNet not only delivers accurate trajectory predictions but also exhibits high consistency across different environments, validating its generalization capability and robustness under real-world conditions.

Table 5.

Standard Deviation of PFNet on the ETH/UCY Dataset (ADE/FDE) with Paired-test Statistics ( $t$ , $W$ ).

	ADE			FDE
Dataset	mean $\pm$ sd	$t$	$W$	mean $\pm$ sd	$t$	$W$
ETH	0.32 $\pm$ 0.068	10.523	15.0	0.58 $\pm$ 0.023	—	—
HOTEL	0.13 $\pm$ 0.013	−22.359	0.0	0.12 $\pm$ 0.015	17.891	15.0
UNIV	0.22 $\pm$ 0.017	39.503	15.0	0.42 $\pm$ 0.012	12.679	15.0
ZARA1	0.12 $\pm$ 0.013	52.684	15.0	0.23 $\pm$ 0.020	25.713	15.0
ZARA2	0.18 $\pm$ 0.013	13.259	0.0	0.32 $\pm$ 0.025	−28.622	0.0

In addition, the table reports paired-test statistics against the best baseline model (LED): the paired t-statistic ( $t$ ) and the Wilcoxon signed-rank statistic $W$ (“—” denotes identical scores to LED; the test statistics are not applicable). Overall, PFNet achieves better $t / W$ values than LED on most subsets.

Conclusion

In this paper, PFNet is proposed for trajectory prediction. By employing a GEDS and a dual-discriminator mechanism, challenges in traditional GAN-based trajectory prediction models, including OOD issues and mode collapse, are effectively addressed. Furthermore, the integration of the $G E$ into the GAN framework not only resolves a series of issues arising from the depth of traditional GCN but also ensures that the data generated in the Generation Phase retains a compact and information-rich graph structure. Finally, with the Prediction Phase, the fusion of graph structural information and original data features is strengthened, allowing the final feature representation to comprehensively model both the spatial and temporal characteristics of pedestrian trajectories. A series of evaluation experiments and visual analyses demonstrate that the proposed model is capable of generating accurate and reasonable predictions across diverse scenarios, making it applicable to trajectory prediction in complex environments.

Since PFNet adopts a GAN-based framework, future work will primarily focus on improving training stability and convergence speed. Additionally, when handling large-scale graph data, the introduction of advanced graph pooling techniques is considered a feasible and effective approach to further enhance the expressive power and computational efficiency of the model.

Footnotes

Acknowledgments

The authors declare that there are no acknowledgments for this work.

ORCID iD

Hui Zhang

Ethical Approval and Informed Consent

This study does not involve human participants or animals; therefore, ethical approval and informed consent are not applicable.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

References

Battaglia

Pascanu

Lai

Jimenez Rezende

(2016). Interaction networks for learning about objects, relations and physics. Advances in Neural Information Processing Systems, 29, 4509–4517.

Cai

Dai

Wang

Chen

Sotelo

M. A.

(2021). Pedestrian motion trajectory prediction in intelligent driving from far shot first-person perspective video. IEEE Transactions on Intelligent Transportation Systems, 23(6), 5298–5313.

Chen

Fan

Zhang

(2023). Unsupervised sampling promoting for stochastic human trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 17874–17884).

Dendorfer

Elflein

Leal-Taixé

(2021). Mg-gan: A multi-generator model preventing out-of-distribution samples in pedestrian trajectory prediction. In Proceedings of the IEEE/CVF international conference on computer vision. (pp. 13158–13167).

Eiffert

Kong

Pirmarzdashti

Sukkarieh

(2020a). Path planning in dynamic environments using generative rnns and monte carlo tree search. In 2020 IEEE international conference on robotics and automation (ICRA). (pp. 10263–10269). IEEE.

Eiffert

Shan

Worrall

Sukkarieh

Nebot

(2020b). Probabilistic crowd gan: Multimodal pedestrian trajectory prediction using a graph vehicle-pedestrian attention network. IEEE Robotics and Automation Letters, 5(4), 5026–5033.

Fan

Zhao

Tang

Yin

(2019). Graph neural networks for social recommendation. In The world wide web conference. (pp. 417–426).

Chen

Lin

Rao

Zhou

(2022a). Stochastic trajectory prediction via motion indeterminacy diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 17113–17122).

Chen

Lin

Rao

Zhou

(2022b). Stochastic trajectory prediction via motion indeterminacy diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 17113–17122).

10.

Kipf

T. N.

Welling

(2016a). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

11.

Kipf

T. N.

Welling

(2016b). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

12.

Kosaraju

Sadeghian

Martín-Martín

Reid

Rezatofighi

Savarese

(2019). Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. Advances in Neural Information Processing Systems, 32, 137–146.

13.

Lefèvre

Laugier

Ibañez-Guzmán

(2011). Exploiting map information for driver intention estimation at road intersections. In 2011 IEEE intelligent vehicles symposium (IV). (pp. 583–588). IEEE.

14.

Lerner

Chrysanthou

Lischinski

(2007). Crowds by example. In Computer graphics forum (Vol. 26, pp. 655–664). Wiley Online Library.

15.

Shan

Narula

Worrall

Nebot

(2020). Socially aware crowd navigation with multimodal pedestrian trajectory prediction for autonomous vehicles. In 2020 IEEE 23rd international conference on intelligent transportation systems (ITSC). (pp. 1–8). IEEE.

16.

Tarlow

Brockschmidt

Zemel

(2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.

17.

Tedrake

Tenenbaum

J. B.

Torralba

(2018). Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566.

18.

Liu

Sun

Jia

Xing

Gao

Sun

Boulnois

Fan

(2019). Chemi-net: A molecular graph convolutional network for accurate drug property prediction. International Journal of Molecular Sciences, 20(14), 3389.

19.

Liu

Zhang

Qiao

Worrall

Y. F.

Kong

(2023). Knowledge-aware graph transformer for pedestrian trajectory prediction. In 2023 IEEE 26th International conference on intelligent transportation systems (ITSC). (pp. 4360–4366). IEEE.

20.

Yuan

(2024). Learning autoencoder diffusion models of pedestrian group relationships for multimodal trajectory prediction. IEEE Transactions on Instrumentation and Measurement, 73, 1–12.

21.

Wang

Zhang

(2023). Ssagcn: Social soft attention graph convolution network for pedestrian trajectory prediction. IEEE Transactions on Neural Networks and Learning Systems, 35(9), 11989–12003.

22.

Mangalam

Girase

Agarwal

Lee

K. H.

Adeli

Malik

Gaidon

(2020). It is not the journey but the destination: Endpoint conditioned trajectory prediction. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part II 16. (pp. 759–776). Springer.

23.

Mao

Zhu

Chen

Wang

(2023). Leapfrog diffusion model for stochastic trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 5517–5526).

24.

Mohamed

Qian

Elhoseiny

Claudel

(2020). Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 14424–14432).

25.

Pellegrini

Ess

Van Gool

(2010). Improving data association by joint modeling of pedestrian trajectories and groupings. In Computer Vision–ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I 11. (pp. 452–465). Springer.

26.

Robicquet

Sadeghian

Alahi

Savarese

(2016). Learning social etiquette: Human trajectory understanding in crowded scenes. In European conference on computer vision. (pp. 549–565). Springer.

27.

Sadeghian

Kosaraju

Sadeghian

Hirose

Rezatofighi

Savarese

(2019). Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1349–1358).

28.

Salzmann

Ivanovic

Chakravarty

Pavone

(2020). Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16 (pp. 683–700). Springer.

29.

Seeger

(2004). Gaussian processes for machine learning. International Journal of Neural Systems, 14(02), 69–106.

30.

Shi

Wang

Long

Zhou

Niu

Hua

(2021). Sgcn: Sparse graph convolution network for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8994–9003).

31.

Shi

Wang

Zhou

Hua

(2023). Trajectory unified transformer for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF International conference on computer vision. (pp. 9675–9684).

32.

Tang

Wang

(2024). Using a diffusion model for pedestrian trajectory prediction in semi-open autonomous driving environments. IEEE Sensors Journal, 24(10), 17208–17218.

33.

Xie

Zhang

Xia

Xiao

Jiang

Zhou

Qin

Chen

(2024). Pedestrian trajectory prediction based on social interactions learning with random weights. IEEE Transactions on Multimedia, 26, 7503–7515.

34.

Yang

Wang

Yan

(2022). Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps. arXiv preprint arXiv:2212.09034.

35.

Ren

Zhao

(2020). Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. (pp. 507–523). Springer.

36.

Yuan

Weng

Kitani

K. M.

(2021). Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9813–9823).

37.

Zhang

Ouyang

Zhang

Xue

Zheng

(2019). Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12085–12094).

38.

Zheng

Zhu

Zhang

Liu

Cheng

Zhao

(2020). Distribution-induced bidirectional generative adversarial network for graph representation learning. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7224–7233).