MAFormer: A cross-channel spatio-temporal feature aggregation method for human action recognition

Abstract

Human action recognition has been widely used in fields such as human–computer interaction and virtual reality. Despite significant progress, existing approaches still struggle with effectively integrating hierarchical information and processing data beyond a certain frame count. To address these challenges, we introduce the Multi-AxisFormer (MAFormer) model, which is organized in terms of spatial, temporal, and channel dimensions of the action sequence, thereby enhancing the model’s understanding of correlations and intricate structures among and within features. Drawing on the Transformer architecture, we propose the Cross-channel Spatio-temporal Aggregation (CSA) structure for more refined feature extraction and the Multi-Axis Attention (MAA) module for more comprehensive feature aggregation. Moreover, the integration of Rotary Position Embedding (RoPE) boosts the model’s extrapolation and generalization abilities. MAFormer surpasses the known state-of-the-art on multiple skeleton-based action recognition benchmarks with the accuracy of 93.2% on NTU RGB+D 60 cross-subject split, 89.9% on NTU RGB+D 120 cross-subject split, and 97.2% on N-UCLA, offering a novel paradigm for hierarchical modeling in human action recognition.

Keywords

Deep learning human action recognition transformer RoPE

1. Introduction

In the context of burgeoning advancements within computer vision, the domain of human action recognition has emerged as a focal point for scholarly inquiry. The primary objective of this endeavor is to precisely categorize sequences of human actions gleaned from video or sensor data. The import of this work is profound, given its broad utility, most particularly in the realm of enhancing human-computer interaction [2]. The ability to comprehend and decipher human actions has the potential to significantly transform user engagement and system efficacy. Traditional strategies in human action recognition encompass a spectrum of methodologies, including video-centric [9,22], skeleton-centric [12,33], and heatmap-centric [37] approaches. Among these, the skeleton-based approach stands out for its proficiency in efficient processing and resilience against background noise and lighting interference.

Fig. 1.

Visualization of the data and feature transformation process, illustrating the Cross-channel Spatio-temporal Aggregation (CSA) module with dashed lines indicating the channel grouping stage’s data shape, and solid lines for input data shape. The CSA module, part of the Multi-Axis Grouping (MAG) structure, encodes joint feature sequences at fixed frame rates, leveraging cross-channel aggregation for effective feature extraction across body parts in action sequences. Parameters: C – number of channels, G – number of channel groups, r – sampling frame rate, T – number of frames, V – total number of nodes, $V_{1}$ – number of nodes per space group.

Within the realm of skeleton-centric approaches, Graph Convolutional Networks [13,15,19] and Transformers [1,4,29] have been pivotal in addressing several intrinsic difficulties and challenges. Central to these challenges is the accurate capture and interpretation of the spatial relationships and dynamics of skeleton joints – an area where GCNs shine due to their proficiency in modeling the human body as a graph. This graph-centric methodology facilitates an understanding of the structural interdependencies among body parts, enabling the model to garner more nuanced spatial information. However, these methods often struggle with effectively capturing long-term dependencies. Conversely, Transformer-based methods have garnered increasing attention in recent years. The superiority of Transformer methods over GCNs primarily lies in their adeptness at managing temporal dynamics and long-range dependencies within action sequences. Transformers, equipped with their sophisticated self-attention mechanism, excel in capturing the temporal correlation among joints – a pivotal element in deciphering human actions that manifest across disparate temporal segments. This facet of their functionality enables a more refined interpretation of sequential data relative to GCNs, which predominantly concentrate on spatial relationships. Although an array of innovative and efficacious methodologies has arisen within the sphere of human action recognition, these approaches have encountered certain limitations and opportunities for enhancement remain. It is a well-established fact that humans are endowed with an innate capacity to comprehend complete actions by observing movements of specific body parts, thereby underscoring the pivotal role of local information priors regarding the human body in the realm of action recognition. However, a significant drawback of numerous contemporary models is their inadequate ability to glean such complex information, particularly with respect to physics-based hierarchical limb structures. This shortcoming curtails their effectiveness in extracting hierarchical features that are indispensable for a thorough grasp of human actions. Additionally, Transformers frequently incorporate Positional Encoding (PE) to enhance their grasp of the sequential order within inputs. Traditional PE approaches, such as absolute positional encoding, often rely on fixed coding vectors that are constant across all positions within the input sequence. This approach can engender positional biases, wherein the model becomes overly reliant on specific positional information, compromising its ability to extrapolate and comprehend the external structure of features, thereby impacting the model’s generalization skills [10,21,27]. Furthermore, Transformer-based methods commonly utilize Multilayer Perceptron (MLP) structures as Feed Forward Networks (FFN). However, research has identified redundancies in the computations within MLPs, emphasizing the necessity for more efficient FFN architectures to fully realize the intrinsic potential of Transformers [32,40]. To address these challenges, we propose enhancing the model’s ability to grasp the correlations and structural complexities both within and among the features. To internalize the features, it is crucial for the model to develop a comprehensive grasp of the distinctive morphology of the human body. This deepened understanding will enable the model to fully appreciate the physiological architecture of key limbs, moving beyond a mere spatial distribution of joint points. Externally, the model must engage in a systematic analysis of the structure of the action itself. This recognition includes the insight that an action is composed of multiple subsequences. By enhancing the model’s capacity to identify and comprehend the interdependencies among these subsequences, it can attain the proficiency to adeptly handle actions encompassing diverse frame counts. A holistic strategy that analyzes both intrinsic and extrinsic features is essential for facilitating a more nuanced and thorough interpretation of human actions.

In this paper, we introduce a novel architecture designed to model the spatial and temporal structures inherent in human behavior, thereby addressing the aforementioned issues and challenges. Our model endeavors to refine the understanding of correlations and structures within features by employing a novel approach to data organization on multiple dimensional axes, thereby capturing richer dependencies, as illustrated in Fig. 1. We term this architecture Multi-AxisFormer (MAFormer). Within this method, we present a Cross-channel Spatio-temporal Aggregation (CSA) module that encodes joint feature sequences sampled at fixed frame rates according to physical hierarchies, leveraging cross-channel aggregation for effective feature extraction. Specifically, we deploy parallel sub-networks within multiple channel groups to facilitate multi-scale feature extraction from action sequences across various body parts. We name this structure Multi-Axis Grouping (MAG). The features are then fused for cross-spatio-temporal feature aggregation, serving as the result of the Transformer structure. In addition, we propose a Multi-Axis Attention (MAA) module that utilizes the CSA module as its feed-forward network (FFN) to achieve more comprehensive feature aggregation. To mitigate the model’s over-reliance on specific location information, we integrate the Rotary Position Embedding (RoPE), which enhances the model’s extrapolation and generalization capabilities through enhanced relative position awareness. Furthermore, previous research [3,31] has investigated the influence of activation functions on Transformer models in specific quantized settings. Building on this work, our study also examines the effects of various activation functions on the performance of our model, thereby optimizing the potential of self-attention mechanisms.

For this paper, the main contributions are as follows:

To augments the model’s comprehension of correlations and structural intricacies both within and among the features, a MAG structure combining cross-channel multi-scale spatio-temporal aggregation with fixed frame rate sampling is proposed, achieving better robustness and generalization ability.

To better obtain the mutual position correlation among the features, the RoPE is introduced to reduce the position bias of a specific position, enabling the model to process data that exceeds the training frame count and enhancing the model’s extrapolation capability.

To enhance feature extraction while reducing redundancy, a MAA module is proposed, applying a presented CSA module to aggregate local spatio-temporal features instead of the original FFN method of aggregating only in spatial dimensions, enhancing the model’s long-term information memory capability.

2. Related work

The progress in deep learning has notably propelled research in human action recognition. The domain primarily revolves around the spatial analysis prowess of Graph Convolutional Networks (GCNs), whereas Transformer-based approaches are increasingly being favored for their advanced capability in managing temporal dynamics and scalability.

GCN-based methods. Skeleton data presents challenges for traditional vector sequence processing, which struggles to mimic the complex spatio-temporal configuration and correlations of human joints. In response, researchers have increasingly turned to topological graph representations that are more naturally aligned with the structure of skeleton data. Graph Neural Networks (GNNs), and specifically Graph Convolutional Networks (GCNs), have emerged as a focal point of innovation in this area. Chen et al.’s Multi-scale Spatio-Temporal GCN (MST-GCN) [6] expands the receptive field in both spatial and temporal dimensions, offering a more holistic perspective on human movement. Chen (B) et al.’s Channel-wise Topology Refinement GCN (CTR-GCN) [5] focuses on dynamic topology and multi-channel features, starting with a shared topology matrix as a universal prior for channels and refining it through the inferrence of channel-specific correlations. Chi’s InfoGCN [8] utilizes an information bottleneck learning objective to learn compact and information-rich latent representations, complemented by an attention-based graph convolution to infer context-relevant skeleton topologies. Ke et al. propose the Spatio-Temporal Focus (STF) [13] framework for skeleton-based action recognition, utilizing spatio-temporal gradients to guide the learning process. Lee et al. propose a Hierarchically Decomposed Graph Convolutional Network (HD-GCN) [15] for skeleton-based action recognition, using a novel HD-Graph to identify relationships between distant joint nodes and an A-HA module to highlight key edge sets. Cheng et al. introduce the Feature Refinement Head (FR Head) [39] for differentiating between ambiguous actions, striving for a more discriminative representation of the skeleton.

However, GCN-based methods encounter difficulties in handling long-term dependencies, which limits their ability to capture the complexity and hierarchical features of human actions. To address these challenges, our model employs a transformer architecture and integrates the concept of GCN-based approach for capturing local relationships between nodes, while also incorporating physical priors related to the human skeletal structure. This integration allows for more accurate extraction of hierarchical features and a more comprehensive understanding of complex actions, leading to improved performance in action recognition tasks.

Table 1
Analysis among various algorithms

Methods Expediency Impairments

GCN-based methods InfoGCN [8] Propose an information bottleneck objective and a self-attention graph convolution module, effectively learning the latent representation of actions and extracting behavioral context information. The performance on large-scale datasets with a large number of categories has not yet been verified, and the scope of application needs to be expanded.

STF [13] Through learning dynamic adjacency matrices and spatio-temporal focus, effectively captures the spatio-temporal dependencies in human skeleton action recognition. Primarily focuses on a single input modality, with future exploration of multi-stream inputs and verification of the applicability of the proposed objectives to other network architectures.

CTR-GCN+FR [39] A feature refinement module based on contrastive learning, named FR Head, is proposed, which effectively improves the performance in distinguishing ambiguous actions. The ability to recognize ambiguous actions in few-shot scenarios still needs to be explored.

Transformer-based methods IIP-Transformer [30] Utilize joint-level and part-level information to effectively extract features based on parts. Less research has been conducted on multi-modal feature fusion.

MotionBERT [41] Propose a unified framework for learning human motion representations from large-scale and heterogeneous data sources, enabling its application to various downstream tasks. Future research could explore fusing the learned motion representations with other video architectures and applying them to more tasks, such as action assessment, segmentation, etc.

STAR-Transformer [1] Combines spatio-temporal video and skeleton features and can efficiently represent cross-modal features. The cost is relatively high and currently only uses annotated skeleton data.

3Mformer [29] Efficiently capturing complex motion patterns between human joints based on hypergraphs and Higher-order Transformer (HoT). It does not discuss the computational complexity of the model in detail, and it may require more computational resources.

	Methods	Expediency	Impairments
GCN-based methods	InfoGCN [8]	Propose an information bottleneck objective and a self-attention graph convolution module, effectively learning the latent representation of actions and extracting behavioral context information.	The performance on large-scale datasets with a large number of categories has not yet been verified, and the scope of application needs to be expanded.
STF [13]	Through learning dynamic adjacency matrices and spatio-temporal focus, effectively captures the spatio-temporal dependencies in human skeleton action recognition.	Primarily focuses on a single input modality, with future exploration of multi-stream inputs and verification of the applicability of the proposed objectives to other network architectures.
CTR-GCN+FR [39]	A feature refinement module based on contrastive learning, named FR Head, is proposed, which effectively improves the performance in distinguishing ambiguous actions.	The ability to recognize ambiguous actions in few-shot scenarios still needs to be explored.
Transformer-based methods	IIP-Transformer [30]	Utilize joint-level and part-level information to effectively extract features based on parts.	Less research has been conducted on multi-modal feature fusion.
MotionBERT [41]	Propose a unified framework for learning human motion representations from large-scale and heterogeneous data sources, enabling its application to various downstream tasks.	Future research could explore fusing the learned motion representations with other video architectures and applying them to more tasks, such as action assessment, segmentation, etc.
STAR-Transformer [1]	Combines spatio-temporal video and skeleton features and can efficiently represent cross-modal features.	The cost is relatively high and currently only uses annotated skeleton data.
3Mformer [29]	Efficiently capturing complex motion patterns between human joints based on hypergraphs and Higher-order Transformer (HoT).	It does not discuss the computational complexity of the model in detail, and it may require more computational resources.

Transformer-based methods. Many methods, such as LSTM, Transformer, etc., contribute to solving the long-term dependency problem, and Transformer-based methods stand out among them. Transformers show promising potential in handling sequence data, leading to their application in skeleton sequences for spatio-temporal modeling. Qiu et al. propose STTFormer to capture multi-joint dependencies between adjacent frames and aggregate sub-movement features [20]. Wang et al.’s IIP-Transformer [30] focuses on action recognition using part-level skeleton data encoding. Liu et al. introduce the Kernel Attention Adaptive Graph Transformer Network (KA-AGTN) [17], capturing various high-order joint dependencies. Zhu et al. propose MotionBERT [41], pretraining a motion encoder to recover 3D motion from 2D observations using diverse data, representing a unified approach in the field. Ahn et al. propose the STAR-transformer [1], integrating full, zigzag, and binary spatio-temporal attention modules to effectively represent and balance cross-modal video and skeleton features for improved action recognition performance. Wang et al. introduce 3Mformer [29] to enhance skeletal action recognition by modeling higher-order motion patterns with hypergraph-based Transformers. Chen et al. propose a transformer-based model [4] with a feature fusion module, utilizing a pre-trained Swin-Transformer for hierarchical feature extraction and fusion to enhance human action recognition in still images.

Compared with the GCN method, Transformer-based methods, while powerful in capturing long-term dependencies, often struggle with spatial modeling and lack access to crucial prior information about human body structure, such as the topological information inherent in graph structures. This deficiency hampers their ability to accurately model the complex and hierarchical nature of human actions, particularly in relation to physics-based limb structures. Table 1 exhibits various comparisons among such algorithms. Our approach addresses these limitations by integrating the strengths of Transformer architectures with enhanced spatial modeling capabilities. We incorporate a detailed understanding of the human skeletal structure, leveraging local information priors to better capture the intricacies of human movements. By doing so, our model not only retains the ability of Transformers to manage long-term dependencies but also improves the extraction of hierarchical features essential for a comprehensive understanding of human actions.

3. Preliminaries

3.1. Notations

In this paper, regular font is used for scalars, e.g., (x, y, z). Boldface Lowercase is used for vectors, e.g., (x, y, z). Boldface Uppercase is used for matrices, e.g., (X, Y, Z). Italic boldface is used for tensors, e.g., ( $X$ , $Y$ , $Z$ ). As shown in Fig. 1, assuming the input is $X \in R^{C_{in} \times T \times V}$ , the output after embedding through a linear layer is $H \in R^{C \times T \times V}$ , where $C_{in}$ is the dimension of the coordinate sequence, C is the embedding feature dimension, T is the number of frames and V is the number of nodes.

3.2. Self-attention mechanism

In Transformer models, the input sequence $X \in R^{N \times d_{model}}$ , where N is the sequence length and $d_{model}$ is the feature dimension, undergoes linear transformations to compute Query (Q), Key (K), and Value (V) as follows: $Q = X W_{Q}$ , $K = X W_{K}$ , $V = X W_{V}$ . Here, $W_{Q}$ , $W_{K}$ , $W_{V} \in R^{d_{model} \times d_{k}}$ are learnable weight matrices, and $d_{k}$ is the dimensionality of queries and keys. The Transformer’s self-attention mechanism computes weighted value vectors based on the similarity between queries and keys: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$ . Here, $\frac{1}{\sqrt{d_{k}}}$ is a scaling factor to prevent the dot product from becoming too large.

Fig. 2.

Architecture of MAFormer. The MAG performs physics-based hierarchical coding on key point feature sequences sampled at a fixed frame rate, and the CSA performs cross-channel feature aggregation. The MAA module uses the CSA as FFN and employs the ${Softmax}_{β}$ as the activation function. The TCN module completes feature aggregation in the temporal dimension.

4. Method

The objective of our study is to refine the model’s grasp of the intercorrelations and intricate structures inherent in individual features as well as their relationships, thereby boosting its proficiency in action identification. To this end, we aim to endow the model with a nuanced understanding of anatomical limb architecture derived from physics, along with temporal patterns underpinning action dynamics and multi-dimensional spatial feature interactions.such information facilitates the model in artfully blending immediate local features with broader contextual global features. In response to these requirements, we introduce MAFormer – a novel framework that employs a Multi-Axis Grouping (MAG) strategy for the hierarchical encoding of keypoint sequences extracted at a consistent frame rate. This is coupled with a Cross-channel Spatio-temporal Aggregation (CSA) module, designed for the efficient amalgamation of cross-channel features. To further augment feature interaction and reduce architectural complexity, we propose the Multi-Axis Attention (MAA) module. This module leverages CSA to facilitate Feed-Forward Network (FFN) operations on feature maps, thereby enhancing cross-channel communication and dimensionality reduction. In addition, the MAA module harnesses Rotary Position Embedding (RoPE) for positional encoding, which bolsters the model’s capacity for extrapolation and generalization. Subsequently, a Temporal Convolutional Network (TCN) module is integrated to facilitate the aggregation of long-term features, optimally utilizing transient relational information. The methodological flowchart delineating the sequence of information processing is presented in Fig. 2. We proceed to outline the technique in alignment with the informational processing sequence, providing a coherent narrative of the model’s architecture and operation.

4.1. RoPE

RoPE is a relative position encoding method proposed by Roformer to help the model analyze the structure of features, thus enhancing its extrapolation capability [27]. The purpose of RoPE is to find a function f to encode the vectors Q and K into complex form, and obtain their positional relationships through inner product, as shown in Fig. 2. Specifically, it groups the elements in the vector Q pairwise along the channel dimension, subsequently reshapes it into a sequence of two-dimensional vector, then transforms it into complex form. Likewise for the vector K, as Eq. (1). $\begin{matrix} (1) & \begin{array}{c} \tilde{Q_{d}} = f (Q_{d}, m) = Q_{d} e^{i (θ_{d} \cdot m)}, \\ \tilde{K_{d}} = f (K_{d}, m) = K_{d} e^{i (θ_{d} \cdot m)}, \end{array} \end{matrix}$ where $Q_{d}$ is the vector encoded by the d-th group of the vector Q, m is the feature sequence position, $f (\cdot)$ is the RoPE function, $θ_{d}$ is the coefficient for the d-th group. According to the geometric meaning of complex multiplication, this transformation actually corresponds to the rotation of the vector, as Eq. (2) and Eq. (3). $\begin{array}{c} (2) & R_{Θ, m}^{d} = (\begin{array}{cc} cos (θ_{d} \cdot m) & - sin (θ_{d} \cdot m) \\ sin (θ_{d} \cdot m) & cos (θ_{d} \cdot m) \end{array}), \\ (3) & f (Q_{d}, m) = R_{Θ, m}^{d} (\begin{array}{c} Q_{d}^{0} \\ Q_{d}^{1} \end{array}), \end{array}$ where $R_{Θ, m}^{d}$ denotes the transformation matrix, $Q_{d}^{0}$ is the first element of the d-th group of Q, and $Q_{d}^{1}$ is the second element accordingly. After using the RoPE on Q and K with positions m and n, $\tilde{Q^{T}} \tilde{K}$ can be represented as Eq. (4): $\begin{aligned} \tilde{Q^{T}} \tilde{K} & = Q^{T} {R_{Θ, m}^{d}}^{T} R_{Θ, n}^{d} K \\ = Q^{T} (\begin{array}{cc} cos (θ_{d} \cdot m) & sin (θ_{d} \cdot m) \\ - sin (θ_{d} \cdot m) & cos (θ_{d} \cdot m) \end{array}) \\ (4) & \cdot (\begin{array}{cc} cos (θ_{d} \cdot n) & - sin (θ_{d} \cdot n) \\ sin (θ_{d} \cdot n) & cos (θ_{d} \cdot n) \end{array}) K . \end{aligned}$ According to the vector algebra theory, ${R_{Θ, m}^{d}}^{T}$ represents the clockwise rotation of the vector at an angle of $θ_{d} \cdot m$ , and $R_{Θ, n}^{d}$ represents the counterclockwise rotation of the vector at $θ_{d} \cdot n$ . The product of the two results in a linear relationship where only the relative rotation difference remains, i.e., $θ_{d} \cdot (n - m)$ . $θ_{d} \cdot m$ satisfies Eq. (5). $\begin{array}{c} (5) & θ_{d} \cdot m = \frac{m}{10000^{2 i / d}}, \end{array}$ where i represents the position of the element. RoPE ingeniously utilizes complex product to introduce relative position information, while generating position codes exceeding a fixed length through rotation matrix, improving the model’s generalization ability and robustness.

4.2. Encoder

For humans, an action can also be recognized when only part of the body or partial movement is seen. Because in many actions, some limbs move more frequently while other limbs may remain stationary. In addition, some motion trajectories can be inferred from visible parts. Inspired by this intuition, we propose a new architecture that captures rich local information from a part of the action sequence to infer global information, thereby aiding human action recognition.

MAG module. As an important module to enhance the model’s understanding of the correlation and structural complexity within and among features, the MAG combines cross-channel and multi-scale spatio-temporal aggregation with fixed frame rate sampling. It divides the human body into five parts according to the physical structure, namely body, left arm, right arm, left leg, and right leg. Denoting the according frame numbers of each part as $V_{i}$ , where $i \in {body, left arm, right arm, left leg, right leg}$ . Each part is sampled at a fixed frame rate r and divided into N groups (referenced later as fixed frame sampling), each group has M nodes, where $N = T \times r$ , $M = V_{i} / r$ , and T is the number of frames. The MAG performed feature extraction through the CSA module, which divides the input feature sequence into G groups along the channel dimension. After feature extraction with different convolution kernel sizes through two branches, it is aggregated through the cross-spatio-temporal attention method, as shown in CSA in Fig. 2. Specifically, the grouped input is regarded as $H_{1} \in R^{C / / G \times N \times M}$ . After average pooling and Softmax, the attention weight descriptors learned by the two branches are used to enhance the spatio-temporal feature representation of interest in the motion feature group of each limb.

As shown in Fig. 2, global average pooling and 1 × 1 convolution to are utilized to encode the global spatial and temporal information of the input sequence respectively, splicing them into the same dimension, i.e., $H_{2} \in R^{C / / G \times (N + M)}$ . The spatial and temporal correlation features are obtained through matrix multiplication and Group Normalization operations, and a weight map is derived using global average pooling followed by a Softmax function. In addition, through the parallel 3 × 3 convolution method, the spatio-temporal correlation in different scales is obtained, and the sum of the correlation weights and features of these two different scales is used as the final spatio-temporal weight map to capture motion information at different scales. After the weights of the two branches are multiplied and added to each other, the features are reorganized in the channel dimension, and the output is $H_{3} \in R^{C \times N \times M}$ . Furthermore, through the aforementioned fixed frame sampling, temporal weights can be calculated on a larger scale, thereby capturing more context dependencies.

MAA module. The output of the MAG module contains spatio-temporal related features of the aforementioned five different limb action sequences, which are aggregated through spatial splicing and Multi-Axis Attention (MAA) mechanisms, as shown in Fig. 2. Among them, Q, K, and V are the query, key, and value vectors in the self-attention mechanism, respectively, as shown in Eq. (6). $\begin{array}{c} (6) & [Q_{u}, K_{i}, V_{i}] = W_{i} H_{3}, i \in {1, 2, \dots, n_{h}}, \end{array}$ where $n_{h}$ is the number of heads, and $W_{i}$ is a trainable linear transformation. After the RoPE, the attention map is calculated as Eq. (7). $\begin{aligned} (7) & \begin{array}{c} A_{i} = σ ({Q_{i}}^{T} K_{i} / \sqrt{d_{k}}) \times α, i \in {1, 2, \dots, n_{h}}, \\ A_{map} = concat (A_{1}, A_{2}, \dots, A_{n_{h}}), \end{array} \end{aligned}$ where $d_{k}$ represents the dimension of K, α represents the scaling factor of the attention map, and σ is the activation function. The ${Softmax}_{β}$ is applied, which adds the contrast factor β to the denominator of the Softmax function, as Eq. (8). $\begin{array}{c} (8) & {Softmax}_{β} = \frac{e_{i}^{x}}{\sum_{i = 1}^{n} e_{i}^{x} + β} . \end{array}$

The attention map is multiplied by V, then passed through $1 \times k_{s}$ convolution and skip connection, and the final output $H_{out}$ is obtained by FFN based on CSA, as shown in Eq. (9). $\begin{array}{c} (9) & H_{out} = CSA (Conv 2 d_{1 \times k_{s}} (A_{map} V) + res (H_{3})), \end{array}$ where $res$ is the residual structure, including a $1 \times 1$ convolution and a BN layer.

In the preprocessing stage of most action recognition methods, a hyperparameter frame count T is set, and the frame count is unified by cutting the sequence exceeding T frames and padding the sequence less than T frames to facilitate the calculation of the model. This can lead to that the models are often compelled to allocate additional attention scores to unnecessary frame features. The introduction of the β aims to adjust the contrast of the Softmax function, thereby reducing the competition among attention scores. This allows the model to assign lower weights in specific situations. In the FFN process, the CSA module is utilized for feature aggregation, enabling the acquisition of multi-scale spatio-temporal feature information.

TCN. Regarding fixed frame sampling, the output features contain feature information of some sub-actions of the input [20]. In this module, the model restores the previous frame sampling operation to obtain $H_{4} \in R^{C \times T \times V}$ . For the characteristics of temporal series, the model employs the Temporal Convolution Network(TCN) for temporal modeling, which includes a $k_{t} \times 1$ convolution, a BN operation, and a skip connection. The output is $H_{out} \in R^{C_{out} \times T \times V}$ , where $C_{out}$ is the output dimension of a specific layer block. The TCN aggregates features between sub-actions to obtain long-term dependency relationships in action sequences.

4.3. Four-stream ensemble

Existing research indicates that integrating various modalities, like joint, bone, joint motion, and bone motion, can substantially improve human action recognition [24,25]. In this paper, our evaluation focuses on models utilizing these four modality streams. Specifically, the bone stream, which employs bone modality as its input, is adopted from 2s-AGCN [25], while the approaches for joint motion and bone motion streams are in line with DGNN [24]. The final outcome is derived from a weighted average based on the inferential outputs of these models.

Table 2
Ablation study of Top-1 accuracy (%) with different groups. The baseline means using MLP as FFN in the original method. Note that in this experiment, the MAG module is not applied

Methods #.Param. Accuracy (%)

Baseline 6.23 M 89.0

$G = 4$ 6.12 M 89.2

$G = 8$ 5.98 M 89.1

$G = 16$ 5.94 M 89.4

$G = 32$ 5.93 M 88.9

Methods	#.Param.	Accuracy (%)
Baseline	6.23 M	89.0
$G = 4$	6.12 M	89.2
$G = 8$	5.98 M	89.1
$G = 16$	5.94 M	89.4
$G = 32$	5.93 M	88.9

5. Experiments

To validate the effectiveness and advantage of the proposed MAFormer, we conducted comprehensive experiments on the NTU-60, NTU-120 and N-UCLA datasets. Furthermore, we performed a comparative analysis with current popular models and detailed ablation studies to explore the performance of the proposed modules under various conditions.

5.1. Datasets

NTU RGB+D (NTU-60) encompasses 56,880 skeleton sequences across 60 action categories [23]. Each sequence represents a singular action executed by up to two individuals, recorded from three distinct camera angles. This dataset is segregated into two benchmarks: cross-subject (X-Sub) and cross-view (X-View).

NTU RGB+D 120 (NTU-120) is an expansion of NTU RGB+D, comprises 114,480 sequences in 120 categories [16]. Captured via three cameras, it includes 32 different settings, each indicative of a unique locale and backdrop. It is partitioned into cross-subject (X-Sub) and cross-setup (X-Set) benchmarks.

Northwestern-UCLA (N-UCLA) features 1,494 video clips in 10 categories. Each clip involves actions by 10 subjects, recorded through three cameras from varying perspectives. The dataset adheres to the evaluation protocol outlined in [28].

5.2. Experimental settings

All experiments are performed on 2 GTX Titan GPUs. The skeleton sequences processed as in [5] are set to 120, 120, 56 frames for NTU-60, NTU-120, and N-UCLA, respectively. For most ablation experiments, the joint modality of NTU-60 under the X-Sub setting is used and the frame number is set to 60 frames. No other data processing or augmentation is applied for fair comparisons. Our model is trained utilizing a stochastic gradient descent (SGD) optimizer with a Nesterov momentum of 0.9, and the weight decay is set to 0.0005. The Cross-entropy is taken as the loss function. The training epoch is set to 90 for NTU-60 & 120, and to 30 for N-UCLA. The training epoch is set to 90 for NTU-60 & 120, and to 30 for N-UCLA. The initial learning rate is 0.1, and a warm up strategy is employed in the first 5 epochs for more stable learning [11]. For the NTU-60, NTU-120, and N-UCLA datasets, the learning rate is decayed at epochs on $[60, 80]$ , $[60, 80]$ , $[15]$ , and the batch size is set to 32, 32 and 16, respectively. The MAFormer block amount is set to 8, the output channel amounts are $[64, 64, 64, 128, 128, 256, 256, 256]$ . The dimensions $d_{q}$ , $d_{k}$ , $d_{v}$ in each layer are set to $0.25 \times C_{out}$ . The Convolution kernels $k_{t}$ , $k_{s}$ are set to $[3, 5]$ . The hyperparameter fixed frame sampling rate is set to $1 / 6$ .

Table 3
Ablation study of Top-1 accuracy (%) with different activation functions. The baseline means using original Softmax. Note that in this experiment, the MAG module is not applied

Methods Accuracy (%)

Baseline 88.6

Tanh 89.0

ReLU 88.4

${Softmax}_{β}$ ( $β = 1$ ) 89.3

Methods	Accuracy (%)
Baseline	88.6
Tanh	89.0
ReLU	88.4
${Softmax}_{β}$ ( $β = 1$ )	89.3

5.3. Ablation studies

To analyze the impact of the different components of the proposed MAFormer, we examine the performance of our model under different configurations and conditions.

Number of G in CSA. The channel dimensions are grouped into multiple sub-features in the CSA structure, the FFN of the MAA module. This experiment was designed to explore the impact of varying the number of channel groups G to be in equilibrium between capturing detailed information and computational efficiency. For different G in the CSA, we calculated the model’s parameter amount and its Top-1 accuracy rate on the NTU-60 under the X-Sub setting, as shown in Table 2, with the best performance marked in bold.

Our findings indicate a modest decrease in the model’s parameter count with increasing G values. This trend suggests that the model becomes increasingly compact as G rises. Concurrently, the rate at which parameters decrease diminishes as G increases further, suggesting an approaching limit to this module’s compression capability. The model’s accuracy exhibits minor fluctuations across varying G configurations. Notably, the model attains its peak accuracy at 89.4% with G set to 16, surpassing the Baseline by 0.4%. This configuration also results in a reduction of 0.29M parameters compared to the Baseline.

These observations underscore that elevating G can preserve or slightly enhance accuracy while marginally reducing the model’s complexity. Furthermore, a G value of 16 appears to offer an optimal balance, yielding higher accuracy with a comparatively lower parameter count.

Fig. 3.

Accuracy variation with respect to different parameters, the MAFormer is trained on the 120-frame NTU-60 under X-Sub setting with the RoPE in this experiment.

Activation function in FFN. Many studies have proposed the impact of activation functions on Transformer models in certain quantized scenarios [3,31]. Inspired by these works, we also explore the impact of different activation functions on the performance of our proposed model, attempting to further harness the potential of self-attention mechanisms. We explored the impact of using different activation functions on model performance, as shown in Table 3 and Fig. 3a. The model with the Tanh activation function achieves an accuracy of $89.0 %$ , slightly surpassing the baseline’s $88.6 %$ . This improvement may be attributed to the Tanh function’s range between −1 and 1, which could be more effective in processing local features compared to the Softmax function. The introduction of β serves to modulate the contrast of the Softmax function, effectively adjusting the level of competition among attention scores. When β is set to 1, 1.5, and 2, a marginal increase in accuracy is observed. This can be due to the increased β yielding a more evenly distributed range of attention scores from the activation function, thereby reducing the dominance of certain features in some cases. The performance of the adaptive β slightly exceeds the baseline but is inferior to the fixed settings. This suggests that while making β a learnable parameter adds flexibility to the model, its effectiveness might be compromised by overfitting or other complexities, compared to a fixed optimal value.

RoPE. In this experiment, we explore the impact of the RoPE on the model’s extrapolation and generalization ability. In our study, it is observed that the MAFormer, when utilizing traditional PE method, performs poorly in predicting new data that exceeds the frame count range of the training dataset. Conversely, when employing the RoPE operation, MAFormer shows considerable ability of predictive capability. To validate its effectiveness in forecasting actions beyond the training frame count, we configures the test set with varying frame numbers and subsequently computed the model’s test accuracy for each configuration. The experimental results are illustrated in Fig. 3b.

It can be seen that the MAFormer exhibits remarkable stability in accuracy across a large range of frame counts. The variations in accuracy are relatively insignificant, suggesting that the model maintains its performance effectively even as the frame count deviates from the training set’s frame count. Interestingly, the model achieves its peak accuracy at 144 frames (90.7%), slightly higher than the training frame count of 120. This indicates an optimal extrapolation range where the model performs best.

In addition, as the frame count increases beyond 144 frames, there is a gradual but consistent decrease in accuracy. However, this decrease is marginal, suggesting that while the model is optimized for a certain range of frame counts, its performance degradation is controlled and minimal even at higher frame counts. The consistency in performance across different frame counts and the minimal decrease in accuracy at higher frame counts imply that the inclusion of the RoPE in MAFormer effectively enhances its extrapolation ability. The model is capable of handling variations in frame count without significant loss in accuracy, showcasing its robustness and adaptability.

Overall, these results demonstrate the efficacy of the MAFormer model with the RoPE, particularly in maintaining high accuracy across varying frame counts and showing resilience against performance degradation with increased frame numbers. This underlines the model’s strong extrapolation capability and potential for practical application in diverse scenarios where frame count variability is a factor.

Four-stream ensemble.

Table 4

Top-1 accuracy (%) of the methods using various data modalities on the NTU-60 and NTU-120 datasets

Modality	NTU-60		NTU-120

	X-Sub	X-View	X-Sub	X-Set
Joint	90.5	95.4	87.0	88.6
Bone	89.9	94.6	86.2	87.9
Joint motion	88.1	92.0	83.9	85.7
Bone motion	87.7	91.5	83.6	85.2
Joint + Bone	92.1	96.5	88.5	90.5
4 ensemble	93.2	97.0	89.9	91.2

Table 5

Performance comparison between MAFormer and prevailing methods in skeleton-based human action recognition tasks on the NTU-60, NTU-120, and N-UCLA datasets

Method	Year	NTU-60		NTU-120		N-UCLA

		X-Sub	X-View	X-Sub	X-Set
ST-GCN [35]	AAAI 2018	81.5	88.3	-	-	-
DGNN [24]	CVPR 2019	89.9	96.1	-	-	-
Dynamic-GCN [36]	ACM MM 2020	91.5	96.0	87.3	88.6	-
SGN [38]	CVPR 2020	89.0	94.5	79.2	81.5	92.5
DDGCN [14]	ECCV 2020	91.1	97.1	-	-	-
DC-GCN+ADG [7]	ECCV 2020	90.8	96.6	86.5	88.1	-
MS-G3D [18]	CVPR 2020	91.5	96.2	86.9	88.4	-
MST-GCN [6]	AAAI 2021	91.5	96.6	87.5	88.8	-
CTR-GCN [5]	ICCV 2021	92.4	96.8	88.9	90.6	96.5
InfoGCN [8]	CVPR 2022	92.7	96.9	89.4	90.7	96.6
InfoGCN (6 s) [8]	CVPR 2022	93.0	97.1	89.8	91.2	97.0
STF (2 s) [13]	AAAI 2022	92.5	96.9	88.9	89.9	-
Ta-CNN [34]	AAAI 2022	90.4	94.8	85.4	86.8	96.1
EffificientGCN [26]	TPAMI 2022	91.7	95.7	88.3	89.1	-
AES [19]	TNNLS 2022	91.6	96.3	88.2	89.2	-
CTR-GCN+FR [39]	CVPR 2023	92.8	96.8	89.5	90.9	96.8
STAR-Transformer [1]	WACV 2023	92.0	96.5	-	-	-
MotionBERT [41]	ICCV 2023	93.0	97.2	-	-	-
Ours(Joint only)	-	90.5	95.4	87.0	88.6	95.2
Ours (2 s)	-	92.1	96.5	88.5	90.5	96.2
Ours (4 s)	-	93.2	97.0	89.9	91.2	97.2

In our study, a four-stream ensemble approach is employed to evaluate the trained models, encompassing joint, bone, joint motion, and bone motion streams. These individual streams, along with the ensemble methods, are assessed using the MAFormer on the NTU-60 and NTU-120, with results detailed in Table 4.

Analyses reveal a progressive enhancement in model performance with an increasing number of streams in the ensemble method. Specifically, on the NTU-60 under the X-Sub benchmark, employing joint + bone and the four-stream ensemble methods results in accuracy improvements of $1.6 %$ and $2.7 %$ , respectively, over the joint-only modality. This finding underscores the efficacy of multi-modal representation in diversifying input features, thereby augmenting the representational capacity and generalization potential of the model.

5.4. Comparison with the prevailing methods

The performance of our proposed MAFormer was assessed across three widely recognized benchmarks, and its efficacy was benchmarked against several contemporary leading methods. As depicted in Table 5, the MAFormer demonstrates superior performance over the compared methods in various settings, particularly when the number of streams is equivalent. Specifically, when the joint motion and bone motion streams are not incorporated, the MAFormer (2 s) achieves a $0.6 %$ higher accuracy than STF (2 s) [13] on the NTU-120 dataset under the X-Set benchmark. Additionally, even with a four-stream configuration, the MAFormer (4 s) surpasses the InfoGCN (6 s) [8] by a margin of 0.2%, attaining superior performances compared with the SOTA methods on the NTU-60 under the X-Sub benchmark.

The comparative analysis distinctly illustrates the robust performance and potential of our MAFormer model, and highlights the efficacy and promising potential of our model in the realm of human action recognition.

6. Discussion and analysis

The visualization of attention scores comparing a specific joint to all other joints within sampled segments of three distinct actions, as depicted in Fig. 4, reveals a striking pattern. The figures predominantly show an increased focus on joints in the initial frames of the actions. This phenomenon can be attributed to the fact that initial movements are typically more unique and informative, thus playing a critical role in action recognition. For actions such as “Taking off Glasses” and “Touching Pocket,” the early movements are essential for the model to differentiate them from similar actions. The variance in attention scores across different actions suggests that the model modulates its focus in response to the distinct dynamic characteristics of each action. The observed pattern in the attention scores reflects the model’s learning mechanism, which emphasizes the early stages of an action. This emphasis could be a consequence of the training data’s characteristics or the inherent design of the model’s attention mechanism, which may be optimized to capture the onset of actions with greater precision. It implies that models should not only concentrate on the entire duration of an action but also should assign greater importance to the initial movements. These insights can inform future advancements in model architecture and training methodologies, thereby enhancing the accuracy and efficiency of human action recognition.

Fig. 4.

Attention distribution of certain joint points and others in different action sampling clips. The thickness and color depth of the dotted line represent higher attention scores. For the sake of simplicity, only lines with scores exceeding 0.7 are shown.

7. Limitations

Despite the MAFormer exhibits commendable performance on the NTU-60, NTU-120, and N-UCLA datasets, these datasets are not without limitations or biases, for example, specific demographics, actions, or environmental conditions may not fully represent all real-world scenarios. In practical applications, the optimization of MAFormer model’s performance relies on video data captured from multiple perspectives by different cameras. For scenarios where multiple cameras cannot be deployed, data augmentation techniques can be used to simulate actions from different perspectives, thereby extending the model’s adaptability. Furthermore, the current implementation of MAFormer is restricted to the fusion of four streams, thereby leaving the potential advantages of incorporating additional stream combinations largely untapped. An exciting direction for future research is to replace certain modules in MAFormer with Vision Transformers (ViTs) or similar architectures, and apply the model to a broader range of tasks, such as scene classification and segmentation.

8. Conclusions

In summary, this study significantly contributes to the domain of human action recognition. The proposed MAFormer model, which is hierarchically structured and operates across spatial, temporal, and channel dimensions, fills critical voids in existing methodologies, especially in handling hierarchical information within action sequences. The novel employment of the CSA module for feature extraction and the FFN of MAA module enhances the model’s capacity to process multi-scale information, thus boosting its generalization performance. The integration of the RoPE within MAFormer noticeably strengthens the model’s extrapolation and generalization abilities, differentiating it from traditional approaches. Our extensive comparative experiments and evaluations unambiguously demonstrate MAFormer’s superiority over mainstream methods, highlighting its effectiveness in capturing local dependencies within action sequences more proficiently. By augmenting the understanding of correlations and structural complexities both within and among the features, MAFormer showcases its potential and superiority, achieving competitive performances against some state-of-the-art methods.

Footnotes

Acknowledgements

This work is supported by National Natural Science Foundation of China (62376286 and 62105038) and Research and Development Program of Beijing Municipal Education Commission (KM202211232001). We are grateful for the support of the organizations.

References

Ahn ,

Kim ,

Hong and

B.C.

Ko , Star-transformer: A spatio-temporal cross attention transformer for human action recognition, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3330–3339.

Ashraf ,

Saleem ,

Ahmed ,

Aslam and

Shuaeeb , Iris and foot based sustainable biometric identification approach, in: 2020 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2020, pp. 1–6. doi:10.23919/SoftCOM50211.2020.9238333.

Bondarenko ,

Nagel and

Blankevoort , Quantizable transformers: removing outliers by helping attention heads do nothing, 2023. arXiv preprint arXiv:2306.12929.

Chen and

Mo , Swin-fusion: Swin-transformer with feature fusion for human action recognition, Neural Processing Letters 55(8) (2023), 11109–11130. doi:10.1007/s11063-023-11367-1.

Chen ,

Zhang ,

Yuan ,

Li ,

Deng and

Hu , Channel-wise topology refinement graph convolution for skeleton-based action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Chen ,

Li ,

Yang ,

Li and

Liu , Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition, in: AAAI, 2021.

Cheng ,

Zhang ,

Cao ,

Shi ,

Cheng and

Lu , Decoupling gcn with dropgraph module for skeleton-based action recognition, in: Computer Vision–ECCV 2020: 16th European Conference, 2020.

H.-G.

Chi ,

M.H.

Ha ,

Chi ,

S.W.

Lee ,

Huang and

Ramani , Infogcn: Representation learning for human skeleton-based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

A.C.

Cob-Parro ,

Losada-Gutiérrez ,

Marrón-Romera ,

Gardel-Vicente and

Bravo-Muñoz , A new framework for deep learning video based human action recognition on the edge, Expert Systems with Applications 238 (2024), 122220. doi:10.1016/j.eswa.2023.122220.

10.

Delétang ,

Ruoss ,

Grau-Moya ,

Genewein ,

L.K.

Wenliang ,

Catt ,

Cundy ,

Hutter ,

Legg ,

Veness et al., Neural networks and the Chomsky hierarchy, 2022. arXiv preprint arXiv:2207.02098.

11.

He ,

Zhang ,

Ren and

Sun , Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

12.

Hu ,

Fang ,

Han and

Qi , Multi-scale adaptive graph convolution network for skeleton-based action recognition, IEEE Access (2024).

13.

Ke ,

K.-C.

Peng and

Lyu , Towards to-at spatio-temporal focus for skeleton-based action recognition, in: AAAI, 2022.

14.

Korban and

Li , Ddgcn: A dynamic directed graph convolutional network for action recognition, in: Computer Vision–ECCV 2020: 16th European Conference, 2020.

15.

Lee ,

Lee and

Lee , Hierarchically decomposed graph convolutional networks for skeleton-based action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10444–10453.

16.

Liu ,

Shahroudy ,

Perez ,

Wang ,

L.-Y.

Duan and

A.C.

Kot , Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding, IEEE transactions on pattern analysis and machine intelligence (2019), 2684–2701.

17.

Liu ,

Zhang ,

Xu and

He , Graph transformer network with temporal kernel attention for skeleton-based action recognition, Knowledge-Based Systems 240 (2022), 108146. doi:10.1016/j.knosys.2022.108146.

18.

Liu ,

Zhang ,

Chen ,

Wang and

Ouyang , Disentangling and unifying graph convolutions for skeleton-based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

19.

Qin ,

Liu ,

Ji ,

Kim ,

Wang ,

R.I.

McKay ,

Anwar and

Gedeon , Fusing higher-order features in graph neural networks for skeleton-based action recognition, IEEE Transactions on Neural Networks and Learning Systems 35(4) (2022), 4783–4797. doi:10.1109/TNNLS.2022.3201518.

20.

Qiu ,

Hou ,

Ren and

Zhang , Spatio-temporal tuples transformer for skeleton-based action recognition, 2022. arXiv preprint arXiv:2201.02849.

21.

Ruoss ,

Delétang ,

Genewein ,

Grau-Moya ,

Csordás ,

Bennani ,

Legg and

Veness , Randomized positional encodings boost length generalization of transformers, 2023. arXiv preprint arXiv:2305.16843.

22.

E.M.

Saoudi ,

Jaafari and

S.J.

Andaloussi , Advancing human action recognition: A hybrid approach using attention-based LSTM and 3D CNN, Scientific African 21 (2023), e01796. doi:10.1016/j.sciaf.2023.e01796.

23.

Shahroudy ,

Liu ,

T.-T.

Ng and

Wang , Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

24.

Shi ,

Zhang ,

Cheng and

Lu , Skeleton-based action recognition with directed graph neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.

25.

Shi ,

Zhang ,

Cheng and

Lu , Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.

26.

Y.-F.

Song ,

Zhang ,

Shan and

Wang , Constructing stronger and faster baselines for skeleton-based action recognition, IEEE transactions on pattern analysis and machine intelligence 45 (2022), 1474–1488. doi:10.1109/TPAMI.2022.3157033.

27.

Su ,

Lu ,

Pan ,

Murtadha ,

Wen and

Liu , Roformer: Enhanced transformer with rotary position embedding, 2021. arXiv preprint arXiv:2104.09864.

28.

Wang ,

Nie ,

Xia ,

Wu and

S.-C.

Zhu , Cross-view action modeling, learning and recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.

29.

Wang and

Koniusz , 3mformer: Multi-order multi-mode transformer for skeletal action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5620–5631.

30.

Wang ,

Peng ,

Shi ,

Liu ,

He and

Weng , Iip-transformer: Intra-inter-part transformer for skeleton-based action recognition, 2021. arXiv preprint arXiv:2110.13385.

31.

Wortsman ,

Lee ,

Gilmer and

Kornblith , Replacing softmax with relu in vision transformers, 2023. arXiv preprint arXiv:2309.08586.

32.

Xie ,

Wang ,

Yu ,

Anandkumar ,

J.M.

Alvarez and

Luo , SegFormer: Simple and efficient design for semantic segmentation with transformers, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 12077–12090.

33.

Xie ,

Meng ,

Zhao ,

Nguyen ,

Yang and

Zheng , Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 6225–6233.

34.

Xu ,

Ye ,

Zhong and

Xie , Topology-aware convolutional neural network for efficient skeleton-based action recognition, in: AAAI, 2022.

35.

Yan ,

Xiong and

Lin , Spatial temporal graph convolutional networks for skeleton-based action recognition, in: AAAI, 2018.

36.

Ye ,

Pu ,

Zhong ,

Li ,

Xie and

Tang , Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 55–63. doi:10.1145/3394171.3413941.

37.

Yuan ,

He ,

Wang ,

Xu and

Ma , Improving small-scale human action recognition performance using a 3D heatmap volume, Sensors 23(14) (2023), 6364. doi:10.3390/s23146364.

38.

Zhang ,

Lan ,

Zeng ,

Xing ,

Xue and

Zheng , Semantics-guided neural networks for efficient skeleton-based human action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

39.

Zhou ,

Liu and

Wang , Learning discriminative representations for skeleton based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

40.

Zhou ,

Z.-Q.

Cheng ,

Li ,

Fang ,

Geng ,

Xie and

Keuper , Hypergraph transformer for skeleton-based action recognition, 2022. In: arXiv preprint arXiv:2211.09590.

41.

Zhu ,

Ma ,

Liu ,

Wu and

Wang , Motionbert: A unified perspective on learning human motion representations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15085–15099.

MAFormer: A cross-channel spatio-temporal feature aggregation method for human action recognition

Abstract

Keywords

1. Introduction

3.1. Notations

3.2. Self-attention mechanism

4.1. RoPE

4.2. Encoder

4.3. Four-stream ensemble

5.1. Datasets

5.2. Experimental settings

Table 3 Ablation study of Top-1 accuracy (%) with different activation functions. The baseline means using original Softmax. Note that in this experiment, the MAG module is not applied Methods Accuracy (%) Baseline 88.6 Tanh 89.0 ReLU 88.4 Softmax β ( β = 1 ) 89.3

6. Discussion and analysis

8. Conclusions

Footnotes

Acknowledgements

References

Table 3
Ablation study of Top-1 accuracy (%) with different activation functions. The baseline means using original Softmax. Note that in this experiment, the MAG module is not applied

Methods Accuracy (%)

Baseline 88.6

Tanh 89.0

ReLU 88.4

${Softmax}_{β}$ ( $β = 1$ ) 89.3