Human action recognition with transformer based on convolutional features

Abstract

As one of the key research directions in the field of computer vision, human action recognition has a wide range of practical application values and prospects. In the fields of video surveillance, human-computer interaction, sports analysis, and healthcare, human action recognition technology shows a broad application prospect and potential. However, the diversity and complexity of human actions bring many challenges, such as handling complex actions, distinguishing similar actions, coping with changes in viewing angle, and overcoming occlusion problems. To address the challenges, this paper proposes an innovative framework for human action recognition. The framework combines the latest pose estimation algorithms, pre-trained CNN models, and a Vision Transformer to build an efficient system. The first step involves utilizing the latest pose estimation algorithm to accurately extract human pose information from real RGB image frames. Then, a pre-trained CNN model is used to perform feature extraction on the extracted pose information. Finally, the Vision Transformer model is applied for fusion and classification operations on the extracted features. Experimental validation is conducted on two benchmark datasets, UCF 50 and UCF 101, to demonstrate the effectiveness and efficiency of the proposed framework. The applicability and limitations of the framework in different scenarios are further explored through quantitative and qualitative experiments, providing valuable insights and inspiration for future research.

Keywords

Human action recognition convolutional features pose estimation transformer

1. Introduction

Human Action Recognition (HAR) aims to classify human actions and has been a prevalent task to solve using deep neural networks based on attention. HAR is fundamental for several application domains such as robotics [1] surveillance and security [2, 3] and autonomous driving [4, 5]. Although HAR has many important applications, it is a difficult problem to solve suffering from a large volume of video data, untrimmed videos, multiple actors, action classes, etc. Previous works [6, 7, 8] have focused and worked on long-time HAR, which depends on long-term past and future information. But a more practical application for real-time applications performs short-time HAR, which relies only on near past information.

In solving HAR problems, Convolutional Neural Network (CNN) based methods are usually complex and require a large number of parameters to train the model, which increases the computational overhead. On the other hand, the Transformer network [9] can produce a much smaller number of trainable parameters because it can process data in parallel, which improves the computational efficiency compared to sequential data input in LSTM. Transformer is an extension of LSTM and was proposed in 2017, and has since been widely used in the field of Natural Language Processing (NLP). In 2020, Dosovitskiy et al. [10] first introduced a Transformer to the field of computer vision (CV) by proposing a Vision Transformer (ViT). In recent years, researchers have started combining the attention mechanism with convolutional neural networks for human action recognition. This approach effectively improves the performance of the model. Among them, Mazzia et al. [11] proposed the Action Transformer (AcT) model, which applies the Transformer to human action recognition by removing the bottleneck of a dataset based on body joints (keypoints) extracted from video frames. Methods for extracting body joints use pose estimation algorithms such as OpenPose [12] and PoseNet [13]. However, these methods have limitations in terms of speed and accuracy. To address this problem, we propose a new approach that combines a CNN with a Transformer. Firstly, image keypoint features are extracted using MoveNet [14], and then spatial features are extracted using a pre-trained CNN model. Finally, spatio-temporal features are processed using a Transformer encoder to classify the frame sequence into activity classes. This approach combines high-precision keypoint detection with high-speed detection capability, improving the performance of the HAR model. The main contributions of the work presented in this paper are summarized as follows:

1.
We propose a new framework for human action recognition which successfully extracts spatial features with rich information using three pre-trained CNN models by transfer learning technique. Convolutional feature extractor is introduced to work on real RGB image frames for richer learning space.
2.
Using latest pose estimation algorithm as model feature extractor and building a training and inference pipeline with convolutional feature extraction for transformer encoder.
3.
Performing varied experimentation to demonstrate the relevance of various architectural design choices and their impact on performance and complexity. We also evaluated the proposed framework on two benchmark datasets, UCF 101 and UCF 50, and achieved high accuracies of 87.50% and 83.41%, respectively.

The rest of the paper is organized as follows. Section 2 will provide an in-depth analysis of related work in the field of human movement recognition. Section 3 will present a detailed description of our proposed framework for human movement recognition. Section 4 will evaluate the performance of the proposed framework using a benchmark dataset and discuss the results. Finally, Section 5 will summarise the main contributions of this paper and suggest some future research directions.
2. Literature review

HAR, as one of the most active research directions in the field of video understanding, plays a crucial role in overall video understanding. Currently, HAR employs a variety of forms, such as appearance, depth, optical flow, and body skeleton, to understand the video content more comprehensively. Among them, deep learning models, especially CNNs, play an important role in HAR. CNN have been the core technology for many visual recognition tasks. For HAR, Simonyan and Zisserman [15] proposed a standard approach to average the prediction results of two separately trained 2D CNN by capturing complementary information on appearance and motion simultaneously. One CNN processes RGB frames and the other processes optical flow frames. Another popular HAR method is 3D ConvNet, which is capable of modeling temporal information without relying on optical flow information. The work of Ji et al. [16] further develops 3D ConvNet by performing 3D convolution operations to extract motion features from spatial and temporal information. With further research, pose estimation algorithms have gradually been combined with HAR to provide new ideas for solving the occlusion problem. For example, Angelini et al. [17] proposed ActionXPose, an action recognition method based on 2D pose estimation. The method utilizes OpenPose for skeleton feature extraction and employs a long-term memory neural network (LSTM) and a 1D CNN for classification. In addition, the classification process incorporates the MLSTM-FCN structure [18], which combines 1D CNN, LSTM [19], and squeezing and stimulated attention mechanisms [20]. In addition, Graph Convolutional Network (GCN) based approaches have advantages in processing data with generalized topologies, such as human skeletal data. Compared with CNN and RNN, GCN can dig deeper into the intrinsic features of the data. Graph convolution [21, 22], as a typical GCN method, can show good performance in processing human skeletal data. Yan et al. [23], in their study, first applied a graph convolution neural network to action recognition and proposed a spatio-temporal graph convolution network (ST-GCN). This method uses OpenPose to extract skeleton data from videos and uses graph neural network GCN combined with spatio-temporal information to achieve action classification. This innovative approach provides important insights for subsequent researchers. Li et al. [24] further optimized this approach by proposing a structure graph convolutional network (AS-GCN) and introduced an Encoder-Decoder structure, called an A-link inference module (AIM), to capture potential dependencies specific to actions. This approach combines action structure graph convolution and temporal convolution, aiming to learn spatial and temporal features for action recognition. Shi et al. [25] also proposed a structure graph convolution network (2s-AGCN) based on their previous work, and this approach makes full use of the correlation information between the joints and the skeleton to further improve the accuracy of action recognition. However, skeleton-based action recognition faces a challenge: the lack of long-term contextual information. To solve this problem, Cho et al. [26] applied Self Attention Network to skeleton-based action recognition, which effectively dealt with this problem by focusing on long-term dependencies. In recent years, with the development of machine translation converters, powerful architectures such as BERT [27] and GPT [28] proposed by Vaswani et al. [9] have made remarkable achievements in the field of NLP. Plizzari et al. [29] explored further on this basis and proposed a two-stream spatiotemporal Transformer network model (ST-TR) based on the dual stream spatiotemporal. This model fuses the Transformer’s self-attention mechanism with graph convolution and shows excellent performance in capturing dynamic dependencies between joints. The Transformer architecture has also been applied to vision tasks. Dosovitskiy et al. [10] introduced Vision Transformer (ViT), a purely Transformer model that performs classification tasks directly using image patches. Gowda et al. [30] further showed that convolutional stems can significantly improve optimization stability and increase peak performance. In addition, Mazzia et al. [11] extended the idea of ViT to the field of HAR by proposing an AcT. AcT encodes the estimated pose information from the video and performs action classification using ViT. Unlike previous approaches, this paper proposes a human action recognition architecture based on CNN and ViT. In addition, we use the latest pose estimation algorithm as a feature extractor for the model and investigate the model at different scales, examining the effects of the number of parameters and the number of attention heads on the model.

3. Proposed method

Our goal is to classify the temporal sequences of images i.e., video, into different class labels representing the type of action happening in the scene. A graphical visualization is given in Fig. 1. More formally, given a sequence of frames $F=f_{1},\ldots,f_{i},\ldots,f_{k}$ , we want to estimate a class label $a\in A$ which represents the corresponding action happening in those frames. Here $A$ is the pre-defined set of action labels. The sequence of frames $F$ has length $K$ and each individual frame element is $f_{i}\in R^{H\times W\times C}$ where $H$ , $W$ and $C$ correspond to height, width and channels respectively.

Figure 1.

Architecture of proposed model.

3.1 Movenet

MoveNet is a lightweight human pose estimation model based on the CentreNet model, proposed by Google in 2021. The model uses MobileNetV2 [31] combined with Feature Pyramid (FPN) as a feature extractor. The feature extraction network outputs four headers, where the Centre is used to predict the geometric centroid of the human body. By multiplying each pixel by weight, the human body center coordinates closest to the center of the image are selected. Keypoint Regression uses the selected human body center point coordinates to perform a rough 17-joint coordinate regression. A keypoint Heatmap is used to obtain more accurate keypoint locations. Finally, the corresponding offset values in the Local offset Header are added according to the coordinate points to get the final pose estimation results. This process is illustrated in Fig. 2.

Figure 2.

MoveNet architecture.

3.2 Convolutional feature extraction

Feature extraction from input frames using convolutional layers helps transformers to focus on better feature representations. Given a sequence of input frames $F=f_{1},\ldots,f_{i},\ldots,f_{k}$ , we extract a convolutional feature representation $x_{i}$ using a feature extractor network $F E$ with weights $\theta_{f}$ . This feature representation is extracted for all individual frames in $F$ separately and therefore, we obtain a sequence of feature representations $X=x_{1},\ldots,x_{i},\ldots,x_{k}$

$\displaystyle{x}_{i}=FE_{\theta{f}}(f_{i})$ (1)

Here $x_{i}\in R^{d}$ where $d$ is the dimension length of feature extractor output. Here it is important to mention that $d$ is kept the same across upcoming sequence level embedding and transformer encoder block. Therefore, this is similar to $d_{\textit{model}}$ mentioned in ViT. The feature extractor network is similar to the convolutional stem of pre-trained models on ImageNet [32]. We experimented with ResNet-18 [33], Efficient Net ${v2s,v2m,B0}$ [34, 35] and WideResNet 50-2 [36]. However, to have a fair comparison with the baseline architecture, we also set the feature extractor $F E$ as a 2-layer MLP which represents the linear projection on the input poses in their work and try to compare the accuracy.

3.2.1 ResNet-18

ResNet-18, the basic architecture of the network is ResNet with a network depth of 18 layers as shown in Fig. 3. ResNet (Residual Network) addresses the issue of gradient vanishing or gradient explosion during the training of deep neural networks by introducing residual blocks, enabling them to be trained to greater depths. Residual blocks form the core component of ResNet-18, each comprising two 3 $\times$ 3 convolutional layers, immediately followed by Batch Normalization and ReLU activation functions. In traditional convolutional neural networks, after a series of convolution, activation, and pooling operations, the spatial information and detailed features of the input data gradually diminish. In ResNet, however, the flow of information is maintained through the introduction of residual connectivity, allowing the input data to directly bypass one or more layers and then be summed with the outputs of these layers. Specifically, if the input of a residual block is $x$ , and the output after two convolutional layers is $F(x)$ , then the final output of that residual block is $H(x)=F(x)+x$ . The “ $+$ ” operation here is an element-by-element summation, which requires that the dimensions of $F(x)$ and $x$ must be the same. If the dimensions are different, the dimension of $x$ can be adjusted to match $F(x)$ by a 1 $\times$ 1 convolution operation.

Figure 3.

Resnet 18 architecture.

3.2.2 Efficient Net

EfficientNet is a model proposed by Google Research in 2019 to optimize model size and computational efficiency while maintaining model accuracy as shown in Fig. 4. The main goal of EfficientNet is to optimize model size and computational efficiency while maintaining high accuracy to adapt to different computational and resource constraints and to achieve optimal performance. EfficientNet successfully achieves the goal of optimizing model size and computational efficiency by synchronizing network depth, width, and resolution using an innovative scaling approach. EfficientNet is typically based on pre-trained models (e.g., ResNet or MobileNet) as the underlying network, which provides a stable and efficient starting point. Through deeply separable convolution and linear bottleneck techniques, the EfficientNet module effectively reduces the number of parameters and the computational effort of the network while maintaining good performance. EfficientNet further improves the efficiency and speed of processing data by adopting the structure of bottleneck convolution and extended convolution. By fine-tuning the network coefficients, EfficientNet achieves the goal of achieving optimal performance with given computational resources. Specifically, it first fixes a coefficient and then uses a network search method to find the best combination of $\alpha$ , $\beta$ , and $\gamma$ . Then by gradually scaling up these three coefficients, eight network models from $B0$ to $B7$ are obtained. This coefficient adjustment strategy is one of the core innovations of EfficientNet. By this method, EfficientNet not only achieves gradual performance improvement in the eight versions from $B0$ to $B7$ but also ensures that each version performs best under its specific resource constraints. In April 2021, the Google team launched the EfficientNetV2 optimized version. Compared to Efficient, EfficientNetV2 has a smaller number of parameters and a faster training speed. The model performance and computational efficiency are further improved by improving the network structure and training strategy. EfficientNetV2 maintains high performance while achieving a lower number of parameters, faster inference speed, and smaller model size providing a better choice for real-world applications.

Figure 4.

EfficientNet architecture.

3.2.3 WideResNet 50

WideResNet is a network architecture that improves on ResNet by increasing the width of the network, aiming to improve the expressiveness and performance of the model. Despite the increase in the number of parameters and computational complexity, WideResNet demonstrates efficient computational performance and excellent results. WideResNet effectively enhances the expressive power of the model by expanding the width of the network, thereby improving performance. Despite the increase in the number of parameters and computational complexity of WideResNet, its efficient computational performance and excellent performance make it an effective network architecture.

3.3 Sequence-level embedding

In our approach, we can use an arbitrary number of frames $K$ to do activity classification, and that’s why we need to accumulate information that is more relevant for the classification task. It is done by the usage of $[\textit{class}]$ token, which we append at the start of the sequence $X$ . This idea has its roots in NLP literature like BERT and more recently in ViT. However, in contrast to ViT, we apply this to the sequence of frame representations instead of image patches. Formally, we concatenate $x_{0}$ token, which is the class token, and obtain a new sequence $X_{\textit{cls}}$ with the length $K+1$ .

$\displaystyle\textit{Xcls}=[x_{0};X]=[x_{0},x_{1},\ldots,x_{i}\ldots,x_{k}]$ (2)

Then, we apply positional encoding $P E$ using similar sine and cosine functions. We also experimented with learnable positional embedding; however, we observed similar results as [9] that did not affect classification accuracy much. Lastly, we get a sequence $Z^{0}$ with a length of $K+1$ and this serves as an input to our transformer encoder.

3.4 Transformer encoder architecture

A complete Transformer Encoder module comprises multiple Transformer Blocks, each including Layer Normalization (Norm), a Multihead Self-Attention Mechanism, Residual Connection, and MLP layers. After processing through these Blocks, the dimension of the input sequence remains unchanged. The module initially normalizes the input sequence through the Norm layer to expedite model convergence. Subsequently, the multi-head self-attention mechanism captures contextual information in the input sequence, enabling the model to comprehend the data from various perspectives by splitting the input into multiple heads and independently computing attention weights. The computed attentional weights are then residually concatenated with the original input to address the gradient vanishing problem. Following this concatenation, the data undergoes another normalization through the Norm layer. Finally, image features are extracted through an MLP layer consisting of a Fully Connected Layer, GELU activation function, and Dropout. This MLP layer facilitates global feature interaction by modeling global information and transforming the input image information into a high-dimensional feature vector. Multi-head attention serves as the core component of the Transformer model for capturing contextual information in the input sequence. The mechanism decomposes the input sequence into multiple sub-sequences and independently computes attention weights for each sub-sequence. These weights are then combined to produce the final attention weights. This design enables the model to simultaneously attend to different parts of the input sequence, thus more effectively capturing contextual information. In the multi-head attention mechanism, the token of each sub-sequence is mapped to three matrices, $Q$ , $K$ , and $V$ (i.e., the query, key, and value matrices), through three fully connected layers. These matrices play a crucial role in Scaled Dot-Product Attention, with the $Q$ matrix indicating the query to attributes related to the token, the $K$ matrix indicating the attributes of the token itself, and the $V$ matrix indicating the information contained in the token itself.

Each head is a Scaled Dot-Product Attention module. By dividing the input sequence into multiple parts and performing the attention computation separately, Multi-Head Attention can understand the input data from several different perspectives, thus enhancing the model’s representational capabilities. This design makes the model more flexible in dealing with complex linguistic phenomena and improves the memory ability for long sequences. At the same time, $h$ Scaled Dot-Product Attention modules are parallelizable without layer-to-layer dependencies, which enables Multi-Head Attention to significantly improve the training and inference speed of the model compared to traditional RNN models due to its parallel processing capability. This makes the Transformer architecture more efficient in dealing with long sequences. And Scaled Dot-Product Attention is calculated as follows:

$\displaystyle\textit{Attention}(Q,K,V)=\text{soft}\max\left(\frac{Q{{K}^{T}}}{% \sqrt{{{\text{d}}_{\text{k}}}}}\right)V$ (3)

In this way, the Scaled Dot-Product Attention module can capture important features in the input data and compare them with other tokens to generate outputs with rich contextual information. In Multi-Head Attention, multiple Scaled Dot-Product Attention modules work in parallel to understand the input data from different perspectives, which further enhances the model’s representational capabilities.

Figure 5.

The left side demonstrates the Architecture of Transformer Encoder while the right side demonstrates the Multi-Head Self-Attention Model.

Our architecture for Transformer Encoder uses similar approach as Vison-Transformer and Action-Transformer. It contains repetitive blocks of an encoding block consisting of Multi-headed Self Attention (MSA) and Multi-layer Perceptron (MLP). Multi-headed Self Attention uses multiple number $H$ of self-attention heads, which helps to identify important features at input via the attention mechanism. As shown in Fig. 5, the left side demonstrates the Architecture of Transformer Encoder while the right side demonstrates the Multi-Head Self-Attention Model.

$\displaystyle{{R}^{b}}=\textit{MSA}(LN({{Z}^{b-1}}))+{{Z}^{b-1}}$ (4) $\displaystyle{{Z}^{b}}=\textit{MLP}(LN({{{R}}^{b-1}}))+{{R}^{b-1}}$ (5) $\displaystyle{{Z}^{B}}=LN({{Z}^{B-1}})$ (6)

were $R$ represents the intermediate representation between MSA and MLP layers, ${b}\in{B}$ where ${B}$ is the number of encoding blocks, and $L N$ represents LayerNorm.

3.5 Classification head

We select the first element $Z_{0}^{B}$ of the sequence $Z^{B}$ output by the Transformer Encoder (corresponding to class token position) and apply a Linear layer with Softmax activation to get the predicted class probabilities for each action,which is represented by $\hat{\mathop{a}}$ .

$\displaystyle\hat{\mathop{a}}=\textit{Softmax}(\textit{Linear}(z_{0}^{B}))$ (7)

4. Experiments

We first discuss the datasets used for experimentation in Section 4.1, then discuss the evaluation metrics using which we present our results in Section 4.1–4.4.

4.1 Datasets

We use UCF 50 [37] and UCF 101 [38] which are challenging in terms of large variations in camera motion and object appearance in contrast to MPOSE 2021 dataset which was presented by Mazzia et al. We also extracta 10-class subset (UCF 10) from UCF 50 using the first ten classes of the dataset for experimentation. UCF 50 has 50 action categories and has 4790 training videos and 1891 testing videos. Similarly, UCF 101 has 101 action categories. The first fold of UCF 101, which we use, contains 9537 training videos and 3783 testing videos in the testing set. Since every video has a long variable length, it is further split into multiple patches of $K$ frames, and a class is assigned to each chunk using the original video’s class. In our implementation, we also provide an option for skipping subsequent frames in the videos as they might provide redundant information. We scale down the spatial resolution of UCF 50 and UCF 101 datasets to 128 $\times$ 128 or 224 $\times$ 224 depending on GPU memory limits and provide this information while quoting results. We tried different combinations of traditional and automatic data augmentations, and we found that a combination of RandomResizedCrop and RandomHorizontalFlip along with AugMix [39] worked the best in most of the cases.

4.2 Evaluation metrics

We use Top-l and Top-5 classification accuracy on the test datasets of UCF 50 and UCF 101. Moreover, to observe and compare the complexity of our models, we also present GFLOPs, i.e., Giga Floating Point Operations persecond.

4.3 Training

Optimization of the whole pipeline is done using Cross-Entropy Loss which is averaged over the examples in training batch $N$ and is given by:

$\displaystyle\iota=-\frac{1}{N}\sum\limits_{\text{n}=1}^{N}{\sum\limits_{\text% {i}=1}^{A}{{{a}_{i}}\log(\hat{a}_{i})}}$ (8)

For optimization, we use AdamW [40] optimizer with a learning rate of 0.00003. We use a maximum batch size of 8 during training, and we could not test it beyond this batch size due to GPU constraints. Moreover, we initialize the weights $\theta_{f}$ using pre-trained ImageNet weights. Further comparison between multiple feature extractors is made in Section 4.4.3.

In the Transformer Encoder, we keep the dimension $d$ similar to the feature extractor’s output dimension. For example, for WideResNet 50-2, the output dimension of the convolutional stem is 2048. Therefore, we also use 2048 as $d$ across the Transformer Encoder. The number of attention heads $H$ and encoding blocks $B$ are kept equal in our experiments, i.e., 4. We use Dropout [41] with a value of 0.4 and the Early Stopping Paradigm to get the best model. Lastly, we implement our approach in Pytorch [42].

4.4 Results and ablation studies

We perform multiple experiments and ablation studies to validate our architectural choices. Therefore, we set up our experiments on different datasets, i.e., UCF 10, UCF 50, and UCF 101.

4.4.1 Number of attention heads

We perform experiments with a different number of attention heads as shown in Table 1. We observe increasing attention heads to eight overfits and decreasing attention heads to two underfits. So, four heads provide a better trade-off. Moreover, GFLOPs don not differ that much. These experiments use WideResNet-50-2 on UCF 50 with 128x128 resolution.

4.4.2 Number of classes

In this experiment, we see the effect of the number of classes on classification accuracy as shown in Table 2. The number of classes increases the complexity of the problem, and it becomes difficult to classify the classes correctly. We see a similar trend from the results, and as the number of classes increases, the accuracy drops.

Table 1
Top-l, Top-5 accuracies and GFLOPs for different number of attention heads

Attention heads	Top-1	Top-5	GFLOPs
2	79.62	89.30	75.67
4	83.41	93.09	76.73
8	78.90	90.11	78.85

Table 2

Top-1, Top-5 accuracies and GFLOPs on different number of classes

Dataset	Resolution	Top-1	Top-5	GFLOPs
UCF 10	128 $\times$ 128	98.39	99.43	76.73
UCF 50	128 $\times$ 128	83.41	93.09	75.67
UCF 101	224 $\times$ 224	87.50	96.00	147.67

4.4.3 Multiple feature extractors

We aim to check the effectiveness of different feature extractors to generate better features for Transformer as shown in Table 3. We see that the WideResNet-50-2 has produced slightly better accuracy when compared to other models but uses a lot more GFLOPs when compared to EfficientNet V2S which produces good enough accuracy. We also correspond to the baseline and see convolutional features perform much better. These experiments were performed with two attention heads on UCF 50 with 128 $\times$ 128 resolution.

Table 3
Top-1, Top-5 accuracies and GFLOPs on different feature extractors

Feature extractor	Top-1	Top-5	GFLOPs
Baseline	27.67	48.90	0.76
ResNet-18	66.30	81.20	11.92
EfficientNet v2 m	66.10	77.87	26.58
EfficientNet v2 s	74.12	83.68	14.12
WideResNet 50-2	79.62	89.37	76.73

4.4.4 Fine-tuned vs. frozen feature extractors

We experiment with freezing the weights and fine-tuning end-to-end to analyze the effect of lmageNet pre-training for feature extractors as shown in Table 4. Fine-tuning the feature extractor end-to-end gives better accuracy on UCF 101 as the information flow from the transformer to the feature extractor fine-tunes the model specific to the task. Here the resolution is 224 $\times$ 224.

Table 4
Top-l accuracy comparing frozen vs. fine-tuned models on UCF 101 dataset

Feature extractor	Frozen	Fine-tuned
EffcientNet B0	67.80	82.70
EfficientNet v2 s	69.10	86.50
WideResNet 50-2	66.20	87.50

4.4.5 Number of frames as input

While observing the impact of the number of frames for experimenting on UCF 101 and four attention heads, we see that 30 frames per video gave us better accuracy compared to 50 frames per video, so higher the number of frames does not imply better accuracy as shown in Table 5. Here the resolution is 224 $\times$ 224.

Table 5
Top-l accuracy comparison between 30 and 50 frames on different feature extractors

Feature extractor	30	50
EffcientNet B0	82.70	78.00
EfficientNet v2 s	86.50	81.30
WideResNet 50-2	87.50	81.60

Table 6

Comparison of the performance of our proposed method with state-of-the-art on UCF 101 dataset

Reference	Methods	Accuracy
Sahoo et al. [43]	DISNet	54.96%
Nguyen et al. [44]	Two stream CNN	86.10%
Kim et al. [45]	Two stream CNN	87.50%
Proposed method	CNN $+$ transform	87.50%

Table 7

Comparison of the performance of our proposed method with state-of-the-art on UCF 50 dataset

Reference	Methods	Accuracy
Liu et al. [46]	LC $+$ Multi-view pooling	78.60%
Ramya et al. [47]	Distance transform $+$ entropy features	80.00%
Uijlings et al. [48]	HOG $+$ HOF $+$ MBH	81.80%
Kantorov et al. [49]	MPEG flow $+$ FV	82.20%
Proposed method	CNN $+$ transform	83.41%

Figure 6.

Classification embeddings comparison between baseline MLP and our method WideResNe-502 using video class tokens.

Figure 7.

Confusion matrix on UCF 50. The verticaaxis represents True Label, and the horizontal axis represents predicted labels.

4.4.6 State-of-the-art comparison

On the UCF 101 dataset, our proposed method demonstrates outstanding performance with an accuracy of 87.50% as shown in Table 6. In comparison, the Dual Input Sequential Network (DISNet) proposed by Sahoo et al. [43] achieves an accuracy of 54.96% on the UCF 101 dataset, significantly lower than our method. This finding further confirms the higher classification accuracy and robustness of our approach in handling complex action recognition tasks. The two-stream CNN method suggested by Nguyen et al. [43] attains an accuracy of 86.10% on the UCF 101 dataset. While this result is already relatively high, it is slightly less accurate compared to our method. This indicates that our method holds a significant advantage in feature extraction and integration, leading to more accurate recognition of different actions. The pre-trained two-stream CNN method proposed by Kim et al. [44] achieves an accuracy of 87.50% on the UCF 101 dataset, a result comparable to our method. However, our method combines CNN with transform, endowing it with higher generalization ability and robustness.

On the UCF 50 dataset, our method demonstrates excellent performance with an accuracy of 83.41% as shown in Table 7. The LC $+$ Multi-view pooling method proposed by Liu et al. [45] achieves an accuracy of 78.60% on the UCF 50 dataset, significantly lower than our method. This finding further confirms the higher classification accuracy and robustness of our approach in handling complex action recognition tasks. Ramya et al. [46] proposed a human action recognition method based on distance transform and entropy features of human contours, achieving an accuracy of 80% on the UCF 50 dataset. While this result is relatively high, its accuracy is slightly less than that of our method, emphasizing the significant advantages our method has in feature extraction and integration for more accurate action recognition. The conventional algorithm HOG used by Uijlings et al. [47] attains an accuracy of 81.8% on the UCF 50 dataset. Although the HOG algorithm holds some application value in the field of computer vision, its accuracy is still lower than our method on the UCF 50 dataset. This further underscores the superiority and effectiveness of deep learning models in action recognition tasks. The MPEG flow $+$ FV method proposed by Kantorov et al. [48] achieves an accuracy of 82.2% on the UCF 50 dataset, a result comparable to our method. However, our method may possess higher generalization ability and robustness compared to their approach.

4.4.7 Comparison of sequence-level embedding with baseline MLP

We visually analyze (see Fig. 6) the difference between learned class tokens for baseline MLP and convolutional feature extractor and observe that class tokens accumulate better information for classification tasks bringing better disparity between examples of different classes. We use UCF 10 for better illustration purposes.

4.4.8 Limitations

Our approach does suffer from misclassifications in a similar data regime. For example, actions with similar body movements like JumpRope and SoccerJuggling get misclassified due to similar hand and leg movements. See Fig. 7 for confusion matrix on UCF 50. Such misclassification indicates the model’s vulnerability toward adversarial attacks such as evasion attacks.

5. Conclusion

This paper aims to address the human action recognition problem and proposes an innovative deep-learning framework for this purpose. The framework utilizes state-of-the-art pose estimation algorithms to precisely extract human pose information from RGB image frames. To effectively capture rich spatial features, a pre-trained CNN model is employed for extracting the pose information. Subsequently, the training inference pipeline for transformer encoder convolutional feature extraction is constructed using the Vision Transformer for feature fusion and classification, thereby enhancing the model’s accuracy and robustness. To validate the effectiveness and efficiency of the proposed framework, we conduct exhaustive experiments on two benchmark datasets: UCF 101 and UCF 50. The experimental results demonstrate high accuracy on both datasets, reaching 87.50% and 83.41%, respectively. In the future, we plan to benchmark on more extensive datasets such ActivityNet/Kinetics, to validate its generalization ability. Additionally, we aim to explore extending the framework to areas like zero-shot classification and domain adaptation while enhancing the model’s robustness against adversarial attacks.

Footnotes

Acknowledgments

This research supported by the Open Project Program of The Key Laboratory of Cognitive Computing and Intelligent Information Processing of Fujian Education Institutions, Wuyi University.

Conflict of interest

The authors declare that they have no conflict of interest.

Data availability

The dataset used in this research are publicly available from the following websites:

UCF 101: https://www.crcv.ucf.edu/data/UCF101.php

UCF 50: https://www.crcv.ucf.edu/data/UCF50.php

References

Rodríguez-Moreno

Martínez-Otzeta

Goienetxea

Rodriguez-Rodriguez

Sierra

. Shedding light on people action recognition in social robotics by means of common spatial patterns. Sensors. 2020; 20(8): 2436.

Vallathan

John

Thirumalai

Mohan

Srivastava

Lin

JCW

. Suspicious activity detection using deep learning in secure assisted living IoT environments. The Journal of Supercomputing. 2021; 77: 3242–3260.

Wang

Srivastava

. The security of vulnerable senior citizens through dynamically sensed signal acquisition. Transactions on Emerging Telecommunications Technologies. 2022; 33(10): e4037.

Martin

Roitberg

Haurilet

Horne

Reiß

Voit

, et al. Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. pp. 2801–2810.

Ben-Younes

Zablocki

Pérez

Cord

. Driving behavior explanation with multi-level fusion. Pattern Recognition. 2022; 123: 108421.

Huang

Ouyang

Wang

. Part-aligned pose-guided recurrent network for action recognition. Pattern Recognition. 2019; 92: 165–176.

Varol

Laptev

Schmid

. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017; 40(6): 1510–1517.

Luvizon

Picard

Tabia

. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020; 43(8): 2752–2764.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

, et al. Attention is all you need. Advances in neural information processing systems. 2017; 30.

10.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 201011929. 2020.

11.

Mazzia

Angarano

Salvetti

Angelini

Chiaberge

. Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition. 2022; 124: 108487.

12.

Cao

Simon

Wei

Sheikh

. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. pp. 7291–7299.

13.

Papandreou

Zhu

Chen

Gidaris

Tompson

Murphy

. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. pp. 269–286.

14.

LeViet

Chen

. Pose estimation and classification on edge devices with movenet and tensorflow lite. 2021

15.

Simonyan

Zisserman

. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems. 2014; 27.

16.

Yang

. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012; 35(1): 221–231.

17.

Angelini

Long

Shao

Naqvi

. 2D pose-based real-time human action recognition with occlusion-handling. IEEE Transactions on Multimedia. 2019; 22(6): 1433–1446.

18.

Karim

Majumdar

Darabi

Harford

. Multivariate LSTM-FCNs for time series classification. Neural Networks. 2019; 116: 237–245.

19.

Memory

LST

. Long short-term memory. Neural Computation. 2010; 9(8): 1735–1780.

20.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. pp. 7132–7141.

21.

Scarselli

Gori

Tsoi

Hagenbuchner

Monfardini

. The graph neural network model. IEEE Transactions on Neural Networks. 2008; 20(1): 61–80.

22.

Pan

Chen

Long

Zhang

Philip

. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. 2020; 32(1): 4–24.

23.

Yan

Xiong

Lin

. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32, 2018.

24.

Chen

Zhang

Wang

Tian

. Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. pp. 3595–3603.

25.

Shi

Zhang

Cheng

. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. pp. 12026–12035.

26.

Cho

Maqbool

Liu

Foroosh

. Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. pp. 635–644.

27.

Devlin

Chang

Lee

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 81004805; 2018.

28.

Radford

Child

Luan

Amodei

Sutskever

, et al. Language models are unsupervised multitask learners. 2019; 1: 9.

29.

Plizzari

Cannici

Matteucci

. Spatial temporal transformer network for skeleton-based action recognition. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III. Springer. 2021. pp. 694–701.

30.

Gowda

Rohrbach

Sevilla-Lara

. Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, 2021. pp. 1451–1459.

31.

Sandler

Howard

Zhu

Zhmoginov

Chen

. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. pp. 4510–4520.

32.

Deng

Dong

Socher

Fei-Fei

. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee. 2009. pp. 248–255.

33.

Zhang

Ren

Sun

. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. pp. 770–778.

34.

Tan

. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR. 2019. pp. 6105–6114.

35.

Tan

. Efficientnetv2: Smaller models and faster training. In: International Conference on Machine Learning. PMLR. 2021. pp. 10096–10106.

36.

Zagoruyko

Komodakis

. Wide residual networks. arXiv preprint arXiv: 160507146. 2016.

37.

Reddy

Shah

. Recognizing 50 human action categories of web videos. Machine Vision and Applications. 2013; 24(5): 971–981.

38.

Soomro

Zamir

Shah

. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 12120402. 2012.

39.

Hendrycks

Cubuk

Zoph

Gilmer

Lakshminarayanan

. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv: 191202781. 2019.

40.

Loshchilov

Hutter

. Decoupled weight decay regularization. arXiv preprint arXiv: 171105101. 2017.

41.

Srivastava

Hinton

Krizhevsky

Sutskever

Salakhutdinov

. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. 2014; 15(1): 1929–1958.

42.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems. 2019; 32.

43.

Sahoo

Modalavalasa

Ari

. DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digital Signal Processing. 2022; 131: 103763.

44.

Nguyen

Ribeiro

. Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer. Scientific Reports. 2023; 13(1): 14624.

45.

Kim

Won

. Action recognition in videos using pre-trained 2D convolutional neural networks. IEEE Access. 2020; 8: 60179–60188.

46.

Liu

Huang

Peng

Wang

. Multi-view descriptor mining via codeword net for action recognition. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE. 2015. pp. 793–797.

47.

Ramya

Rajeswari

. Human action recognition using distance transform and entropy based features. Multimedia Tools and Applications. 2021; 80: 8147–8173.

48.

Uijlings

Duta

Sangineto

Sebe

. Video classification with densely extracted hog/hof/mbh features: An evaluation of the accuracy/computational efficiency trade-off. International Journal of Multimedia Information Retrieval. 2015; 4(1): 33–44.

49.

Kantorov

Laptev

. Efficient feature extraction, encoding and classification for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. pp. 2593–2600.

Human action recognition with transformer based on convolutional features

Abstract

Keywords

1. Introduction

3. Proposed method

3.3 Sequence-level embedding

4.1 Datasets

4.2 Evaluation metrics

4.3 Training

4.4.1 Number of attention heads

4.4.2 Number of classes

Table 1 Top-l, Top-5 accuracies and GFLOPs for different number of attention heads

Table 3 Top-1, Top-5 accuracies and GFLOPs on different feature extractors

Table 4 Top-l accuracy comparing frozen vs. fine-tuned models on UCF 101 dataset

Table 5 Top-l accuracy comparison between 30 and 50 frames on different feature extractors

4.4.7 Comparison of sequence-level embedding with baseline MLP

4.4.8 Limitations

5. Conclusion

Footnotes

Acknowledgments

Conflict of interest

Data availability

References

Table 1
Top-l, Top-5 accuracies and GFLOPs for different number of attention heads

Table 3
Top-1, Top-5 accuracies and GFLOPs on different feature extractors

Table 4
Top-l accuracy comparing frozen vs. fine-tuned models on UCF 101 dataset

Table 5
Top-l accuracy comparison between 30 and 50 frames on different feature extractors