Abstract
3D skeleton data has been widely used in action recognition as the skeleton-based method has achieved good performance in complex dynamic environments. The rise of spatio-temporal graph convolutions has attracted much attention to use graph convolution to extract spatial and temporal features together in the field of skeleton-based action recognition. However, due to the huge difference in the focus of spatial and temporal features, it is difficult to improve the efficiency of extracting the spatiotemporal features. In this paper, we propose a channel attention and multi-scale neural network (CA-MSN) for skeleton-based action recognition with a series of spatio-temporal extraction modules. We exploit the relationship of body joints hierarchically through two modules, i.e., a spatial module which uses the residual GCN network with the channel attention block to extract the high-level spatial features, and a temporal module which uses the multi-scale TCN network to extract the temporal features at different scales. We perform extensive experiments on both the NTU-RGBD60 and NTU-RGBD120 datasets to verify the effectiveness of our network. The comparison results show that our method achieves the state-of-the-art performance with the competitive computing speed. In order to test the application effect of our CA-MSN model, we design a multi-task tandem network consisting of 2D pose estimation, 2D to 3D pose regression and skeleton action recognition model. The end-to-end (RGB video-to-action type) recognition effect is demonstrated. The code is available at
Introduction
Human action recognition is an important research direction in the field of computer vision. It has wide application scenarios and market value, such as abnormal behavior monitoring [10,37,48], human–computer interaction [8], etc. In particular, skeleton-based human action recognition methods combined with depth estimation technology [11,18] have attracted increasing attention from researchers. A skeleton sequence is a kind of abstract human body movement data, which uses joint types, 3D joint coordinates and joint connections to express the movement of various body parts. Compared to RGB videos, skeleton data also has the following advantages. First, the cost of obtaining human skeleton data has become lower with the development of pose estimation technology and depth cameras. Second, the skeleton data can reduce the overfitting problem in network training and the network’s coupling to subjects’ appearances. Third, the skeleton sequence can more intuitively show the movement of various body parts by using the graph topology representation for joints. Fourth, the skeleton data eliminates environmental noise (e.g., background, clothing, brightness) so that neural networks can focus more on modeling human movements and reduce the cost of feature extraction. Furthermore, skeleton-based action recognition can be used as a supplement to RGB-based action recognition, thereby increasing the information richness and improving the overall recognition accuracy. In this work, we focus on skeleton-based action recognition.
There are three basic directions for performing skeleton-based action recognition: based on Recurrent Neural Networks (RNN) [7,20,29,38,43,50,59,65], based on Convolutional Neural Networks (CNN) [2,3,15,17,19,23,30], based on Graph Convolutional Networks (GCN) [4,6,21,24,35,36,39,40,44,45,49,51,53,54,56,57,62], and based on two of the above methods [22,25,31,41,42,55,60,61,64]. RNN-based approaches mainly use models such as LSTM/GRU to model the dynamic changes of the skeleton sequence. However, RNN methods only arrange the joint coordinates into a vector in a certain order and then input it into the recurrent neural network. The important structural information is ignored since the different joint types and connections are not distinguished. CNN-based approaches organize the joint coordinates to a 2D map. The 3D coordinates
Many recent studies have focused on exploring the implicit connection between distant joints, such as the relationship between arm swing and foot swing when walking. The two-stream adaptive graph convolution network (2s-AGCN) [39] and actional-structural graph convolution network (AS-GCN) [24] invented the adaptive graph structure. In this structure, the adjacency matrix is not limited to natural bone connections but adaptively explores each joint’s correlation as the dataset changes. However, adaptive graph approaches only focus on exploring the correlation between the spatial joints and do not explicitly model interdependencies between the channels. For actions such as “waving hand”, the channels about the body’s frontal plane are more important than the channels about the body’s median plane.
In terms of time series modeling, researchers mostly use Long Short-Term Memory (LSTM) network or Temporal Convolution Network (TCN). It is challenging for LSTM to learn the long-distance temporal correlations considering problems like losing effective features and vanishing gradient. TCN is the mainstream CNN method for modeling skeleton sequences, but the lack of time series direction and the single scale make it difficult to guarantee the richness and comprehensiveness of the information extracted by TCN.
In this work, we address the above limitations from two aspects (Fig. 1). First, we introduce the channel attention mechanism in the GCN module to model interdependencies between the joint channels. Therefore, the GCN spatial feature extraction module can focus on important semantics more efficiently. Second, we use dilated convolutions [58] with different dilation rates to process time series in parallel, and adopt the technique of deep concatenation in [47] to achieve the fusion of different receptive fields. This leads to a more powerful network that can not only model long-term skeleton actions, but also recognize short-term repetitive actions such as clapping hands. We also incorporate the joint type and frame index to the network [61], so that the joint connections and the temporal sequences both have directionality. Besides, the skeleton sequence’s dynamics (position/3D coordinates and velocity) are input into the network to make periodic characteristics of some actions are also merged into the input features. In the end, we propose the powerful channel attention and multi-scale neural network, named CA-MSN. We illustrate the overall model architecture in Fig. 2.

The characteristics of the CA-MSN skeleton action recognition model: (1) using channel attention GCN to extract joint relationships. The intensity of the color indicates the channel importance; (2) using multi-scale TCN to model time series.

Model overview. First, the skeleton sequence is extracted from the original video. Then, the information transfer in both spatial and temporal directions is separated.
To verify the effectiveness of the proposed CA-MSN, we conduct extensive experiments on two large-scale datasets: NTU-RGBD60 [38] and NTU-RGBD120 [31]. The experiments have demonstrated that the channel attention GCN and the multi-scale TCN can significantly improve network accuracy. For better application in real life, a multi-task tandem network is designed to realize the complete action recognition process (RGB video, 2D pose, 3D pose, action type). We summarize our main contributions as follows:
We propose a channel attention mechanism (CA-GCN) for graph convolutional networks, which effectively models the relationship between joint feature channels and improves the spatial feature extractor’s performance.
We propose a multi-scale temporal feature extraction scheme (MS-TCN) for skeleton sequence, so that both long-term continuous actions and short-term repetitive actions can be precisely classified.
We connect the CA-GCN and MS-TCN modules in series to form a powerful skeleton action recognition network: CA-MSN, which shows the state-of-the-art performance on the NTU-RGBD60 and NTU-RGBD120 datasets.
We design a complete end-to-end network from RGB videos to actions, so that the effect of our proposed CA-MSN model in the application process can be displayed.
3D skeleton action recognition
With the rise of deep learning and neural networks, end-to-end approaches are more competitive than traditional approaches that use hand-crafted features in the field of 3D skeleton action recognition. Most of the earliest end-to-end approaches use recurrent neural networks such as LSTM/GRU. Du et al. [7] divide the human skeleton into five parts according to the human’s physical structure and then separately feed them to five subnets. Zhu et al. [65] take the skeleton as the input at each time slot and introduce a novel regularization scheme to learn the skeleton joints’ co-occurrence features. Inspired by the skeleton graphical structure, Liu et al. [29] propose a more powerful tree-structure-based traversal method. After that, CNN approaches gradually emerge and are widely used in 3D skeleton action recognition. Kim et al. [17] and Liu et al. [30] use the CNN characteristics to explicitly learn interpretable action spatio-temporal representations and visualize them. Ke et al. [15] and Le et al. [19] exploit the correlations between the different time periods of a skeleton sequence. In 2018, ST-GCN [56] sets a precedent for using graph neural network methods to process skeleton sequences. After that, the GCN method gradually becomes the mainstream method in the field of skeleton action recognition. Li et al. [24] and Shi et al. [39] make the topology of the graph model can capture implicit joint correlations. Shi et al. [40] also represent the skeleton data as a directed acyclic graph (DAG) based on the dependency between the joints and bones in the natural human body. Peng et al. [34] propose the first automatically designed GCN for skeleton-based action recognition. However, the huge computational complexity brought by the graph convolution network method is challenging to solve. Recently, in order to combine the advantages of RNN, CNN and GCN networks, researchers use GCN module to extract the topological relationship and RNN or CNN to model time series [25,31,42,61].
Attention mechanism in computer vision
The attention mechanism was first produced in NLP and then widely used in the field of computer vision. Its basic idea is to teach the network to ignore the irrelevant information and focus on the key information. After the years of development, the attention mechanism is mainly classified into three types: spatial attention [1,14], channel attention [12,26], spatial and channel hybrid attention [52].
In most computer vision problems, only task-related areas need to be concerned, such as the subject in classification tasks. The spatial attention allows the network to focus more on essential spatial areas. Spatial Transformer Network (STN) [14] proposed by Google DeepMind is the most representative spatial attention network. Different from the single-stage STN, Dynamic Capacity Network (DCN) [1] uses two sub-networks: low-capacity network and high-capacity network. Low-capacity network is used to process the entire image and locate the region of interest. High-capacity network refines the region of interest.
For the feature maps in CNN, the modeling of channel dimensions is also crucial. Squeeze-and-Excitation Networks (SENet) [12] learn the importance of each channel and then enhance or suppress different channels according to different inputs. Furthermore, Selective Kernel Networks (SKNet) [26] and other methods combine such channel weighting idea with the multi-branch network structure to improve the network performance.
Convolutional Block Attention Module (CBAM) [52] is a representative network of the spatial and channel attention hybrid mechanisms. The channel dimension utilizes both the max pooling outputs and average pooling outputs with a shared network. Next, the both outputs are merged using element-wise summation. The spatial dimension also uses the max pooling and average pooling to concatenate a feature map, and then uses convolutional layers for learning. In addition, there are many researches related to the attention mechanism, such as residual attention, multi-scale attention, recursive attention, etc.
Channel attention and multi-scale neural networks

This figure shows the architecture of the proposed channel attention and multi-scale neural networks (CA-MSN). Input is the addition of position and velocity. Joint-type and frame-index are separately incorporated before CA-GCN and MS-TCN. CA-GCN can learn the spatial relationship between joints and the dependencies between channels. MS-TCN can aggregate features of different temporal scales.
Channel attention-aware spatial modeling and multi-scale temporal modeling are the two main modules of our network. For some simple actions with clear direction, the channel attention mechanism can make the GCN network focus more on channels with rich information. The multi-scale temporal modeling makes the skeleton action recognition network more adaptable to actions with different velocities. A skeleton-based video is a sequence of frames formulated as
Encoding input features
The 3D coordinate of the skeleton sequence
Adjacency matrix
Since the natural connection between joints cannot fully represent the coupling relationship between skeleton joints during actions, we need to recalculate the adjacency matrix to represent the dynamic correlation weight. There are currently three mainstream methods for obtaining an adjacency matrix:
The pure inner product method is too simple to explore the potential connection relationship between joints and does not have data-driven characteristics. Therefore, this work uses the adaptive learning method to obtain the adjacency matrix:
Channel attention GCN

CA-GCN block can model the potential relationship among joint channel features by adding a channel feature learning module after graph convolution. Before learning the channel weights, average pooling is used to extract the channel descriptor. The recalculated output is obtained by multiplying each channel by the corresponding weight.
Using the normalized adjacency matrix
Firstly, we squeeze the global spatial and temporal information into a channel descriptor using global average pooling:
To extract the skeleton sequence features of different temporal scales, we use multi-scale TCN as shown in Fig. 5 to model the time series. First, we merge the one-hot encoded frame index:

The structures of MS-TCN unit. Compared with a single-scale temporal series classifier, the MS-TCN can robust to actions of different temporal scales.

Illustration of the human skeleton graphs on NTU-RGBD dataset.
To prove the effectiveness of our method, we conduct extensive experiments on two skeleton-based action recognition datasets: NTU-RGBD60 [38] and NTU-RGBD120 [28]. We first perform exhaustive ablation studies to verify the efficiency and capacity of our proposed channel attention graph convolutional networks (CA-GCN) and multi-scale temporal convolutional networks (MS-TCN) on the NTU-RGBD60 dataset. Finally, the network is evaluated on NTU-RGBD60 and NTU-RGBD120 datasets to compare with the other state-of-the-art approaches.
Datasets
All models are trained with the same batch size (64), learning schedule (adam with an learning rate as 0.001 and reduced by 10 in epoch 60, 100, 120), and training epoch (140) with the Pytorch framework on a workstation with an AMD Ryzen 9 3900XT CPU, an NVIDIA RTX 3090 GPU, and 64 GB of ECC RAM. Before the experiments, we preprocess the initial skeleton sequences. Similar to [40], in order to eliminate the falsely detected body skeleton, we first determine that the body energy is the summation of the skeleton’s standard deviation across each channel. Then we choose the top two skeletons with the most energy. According to [29], we split the raw video into 20 clips and randomly select a frame in each clip to compose a sequence with 20 frames. Finally, we center the skeleton in every frame to eliminate the influence of subject’s position.
Ablation study
In this section, we examine the effectiveness of the proposed channel attention GCN block, multi-scale TCN block and their related components. Moreover, in order to have a deeper understanding of the channel attention in skeleton action recognition, we deeply analyze the internal attention distribution and the effect on different actions. To ensure no serious over-fitting and under-fitting problems, the dropout rate is adjusted accordingly in each experiment.
Semantic of joint type and frame index
In Section 3, we introduce the joint type (JT) and the frame index (FI) into the network by adding and concatenating, separately. The both different processing ways are based on the following analysis and experiments. Above all, to our knowledge the joint type is more important than the frame index. Because the MS-TCN has implicitly encoded the order of the sequence which is strengthened by the FI. Therefore, adding FI into the network is sufficient for this purpose and save the amount of calculation. In contrast, if there is no JT in the GCN network, it is completely impossible to distinguish the joint type, which is crucial for the overall network. The input concatenated with the JT can have richer semantic information and stronger expression ability. The experiments in Table 1 verify the above conclusions. w/o denotes without this semantic information, add denotes this semantic information is added into the network, cat denotes this semantic information is concatenated into the network.
Verify the difference between introducing semantic information in different ways
Verify the difference between introducing semantic information in different ways
For the joint type, adding into the network brings the performance improvement of 0.4% and 0.4% in the accuracy of the CS and CV settings, concatenating into the network brings the performance improvement of 1.4% and 1.1% in the accuracy of the CS and CV settings. It is obvious that when the difference in the number of parameters is only 0.04M, the effect of concatenation method is much better than add method, so we choose the concatenation method to fuse the joint type and features. For the FI, adding into the network brings the performance improvement of 0.6% and 0.5% in the accuracy of the CS and CV settings, concatenating into the network brings the performance improvement of 0.8% and 0.4% in the accuracy of the CS and CV settings. The concatenation method increases the 0.13M parameters compared with the add method while the accuracy is almost not improved, so we choose the add method to fuse the frame index and features.
As shown in Table 2, we use three series GCNs without graph channel attention mechanism as the baseline for spatial modeling. By adding the graph channel attention mechanism to GCN at different depths, it can be shown that how to maximize the graph channel attention’s performance with the fewer number of parameters. For example, All CA-GCN means that all three GCN modules use graph channel attention, and First CA-GCN means that only the first GCN module uses graph channel attention. Comparing the results of experiments, it shows that only adding graph channel attention to the first GCN module can not only use a smaller number of parameters, but also have the same accuracy as all three GCNs using graph channel attention. The results inspire us that simply superimposing graph channel attention cannot continuously improve the performance of the overall network. Only when channel interdependencies need to be considered, the graph channel attention can work.
Effectiveness of CA-GCN and MS-TCN on the NTU-RGBD60 dataset
Effectiveness of CA-GCN and MS-TCN on the NTU-RGBD60 dataset
In order to explain the results of the experiments in Table 2 and understand the role of the excitation operator in CA-GCN more clearly, in this section, we visualize the channel activation distribution of different network depths and different action classes.
Above all, we observe the difference in channel activation between different actions. Specifically, we sample five categories from the NTU-RGBD60 dataset: type on a keyboard, point to, jump up, hug and eat meal. We select 64 samples from the five action classes to generate their channel activation weights, and then average these activations of each channel to plot Fig. 7.

Activations induced by excitation operator in CA-GCN modules of different depths and different classes. Each set of activations is named according to the following scheme: CA_GCN_blockID.
We make the following two observations about the role of channel attention. Firstly, in the five classes of actions, it can be found that there is a clear difference between the two-player action (hugging) and the other four single-player actions in the line chart. Visualization results show that the channel attention mechanism can clearly distinguish single-player actions and multi-player actions. Secondly, the channel activations of the four single-player actions are basically the same. Similar to the findings of different pictures in [26], the earlier layer features in different single-player skeleton actions are usually general. Only the deeper level features can effectively distinguish the single-player skeleton actions.
Next, we compare the output of the excitation operator in different depth CA-GCN. We find that as the feature level deepens, the discrimination of channel activation between different channels decreases (for all categories), which is very obvious in CA_GCN_3 (the last block of spatial feature extraction). This result proves that channel attention has a lower effect on channel recalibration in the GCN module close to the global pooling layer than in earlier modules. The explanation also indirectly proves the experimental results that using channel attention in the first GCN module better than using it in the latter GCN module in Table 2.
Effects of different scales branch combination, * indicates the best performing set of experiments in the table
We use the multi-scale TCN to model the time series. The number and type of branches have a great impact on the final modeling effect. Table 3 compares the effects of the combined models with different scale branches. D1 means
Attention influence of different actions

The top-six actions with the greatest accuracy improvement after using the graph channel attention: A27: jump up A26: hopping A25: reach into pocket A23: hand waving A31: point to something A30: type on a keyboard.
Our graph channel attention approach has different effects for different skeleton actions. In Fig. 8, we select the top-six action categories with the most significant effects after using graph channel attention to show the accuracy. These actions can be divided into two categories according to the difference in motion dimensions. The first action type has clear directionality. For example, jumping up and hopping are global movements that are completely perpendicular to the ground; reaching into a pocket and pointing to something are local movements with clear directionality. The second action type has a fixed surface to move. For example, while waving hand, the arm only moves in the frontal plane of the human body; while typing on a keyboard, both hands’ movement is basically limited to the surface of the keyboard. On the contrary, channel attention has a weak effect on brushing teeth, falling down, taking a photo, wielding knife and et al. In summary, the actions mentioned above which are more sensitive to graph channel attention have significant differences in the information richness of different dimensions and joints.
After determining the network architecture, the choice of hyperparameters is also critical to the model performance. There are 4 FC layers in our network to encoding features. The FC layers which encode the joint position features, the joint velocity features and the joint type have 64 nodes. The FC layer which encodes the frame index has 256 nodes. In Table 4, we explore the effect of reduction ratio in the graph channel attention module. From the experimental data in Table 4, it can be found that the reduction ratio and the parameters are inversely proportional. In general, with the increase of reduction ratio, the accuracy decays more faster. We can use an appropriate reduction ratio to balance the accuracy and the computation complexity. In this paper, the reduction ratio is set to 1 for achieving the best performance. The dropout is used only in MS-TCN, the dropout rate is set to 0.3.
Graph channel attention with different reduction ratios, * indicates the best performing set of experiments in the table
Graph channel attention with different reduction ratios, * indicates the best performing set of experiments in the table
In Table 5, we compare various typical methods such as: RNN-based [7,29,38,43,59], CNN-based [15,17], GCN-based [24,39,56], mixed methods-based [41,61] with our skeleton action recognition model CA-MSN (Fig. 3) on the NTU-RGBD60 dataset. In Table 6, we compare the results of our model on the NTU-RGBD120 dataset with other methods to prove the effectiveness of our method for fine-grained motions and object-related actions.
Classification accuracy comparison against state-of-the-art methods on the NTU-RGBD60 Skeleton dataset
Classification accuracy comparison against state-of-the-art methods on the NTU-RGBD60 Skeleton dataset
Classification accuracy comparison against state-of-the-art methods on the NTU-RGBD120 Skeleton dataset
Some pure LSTM methods in Table 5, such as Part-Aware LSTM [38] and STA-LSTM [43], have an lower accuracy by about 20% compared with our method. The accuracy of the typical CNN methods, such as Clips + CNN + MTLN [15] and RotClips + MTCNN [16] in Table 5 and Table 6 have not reached the advanced level which can be applied. CA-MSN outperforms the typical spatio-temporal GCN method ST-GCN [56] in Table 5 by 7.9% in the accuracy for CS setting. The above results show that simply using a certain method to model both the temporal and spatial characteristics of skeleton sequences is limited and cannot fully explore the potential spatial and temporal dependencies.
Compared with the mixed model of multiple methods, such as SR-TSL [42] and SGN [61], our CA-MSN has more in-depth exploration of the potential relationship between the frames and the channels of skeleton actions. In Table 5, CA-MSN brings the performance improvement of 4.2% and 2.1% in the accuracy of the CS and CV settings than SR-TSL. Notably, our method is the first to integrate the channel attention mechanism into the graph network for skeleton action recognition. The accuracy comparison results also verify the effectiveness of our method.
In the above research, we propose an advanced skeleton-based action recognition model CA-MSN. However, in actual applications, the input is usually RGB videos, while the input of CA-MSN model is 3D skeletons. In this chapter, we will explore how to extract human 3D skeleton sequences from raw videos and classify them with the CA-MSN model in series. As shown in Fig. 9, the OpenPose method is firstly used to extract 2D skeleton from the original RGB image. Subsequently, the semantic graph convolution network (SemGCN) learns the potential relationship between 2D skeleton sequences and 3D skeleton sequences to predict the 3D pose. Finally, the 3D pose is used as the input of the CA-MSN to obtain the final action classification result. It is worth noting that the three networks in Fig. 9 are trained separately and then used in series.

This figure shows the series network architecture of pose estimation and action recognition. The whole model has three parts: OpenPose, SemGCN and CA-MSN. Each sub-network uses the loss of its own task for training.
There are two mainstream methods for extracting 3D skeletons from monocular RGB images. The first method uses the deep learning model to establish an end-to-end mapping from monocular RGB images to 3D coordinates, but the features that need to be learned are too complex for a single model. The second method needs two steps. The first step is to get 2D skeleton using 2D pose estimation model. The second step is to regress the identified 2D skeleton to predict the 3D skeleton using the prior knowledge of the dataset. Although the end-to-end regression method is simple to operate, the accuracy is difficult to guarantee due to the complexity of feature mapping. So we use the two-step method. The 2D pose estimation is implemented using OpenPose proposed by Cao et al. [5] of Carnegie Mellon University (CMU) in 2017. The mapping encoding from 2D to 3D poses is managed by SemGCN proposed by Zhao et al. [63] in 2019. There are two reasons for choosing the SemGCN network. Above all, this method uses graph convolution networks to process joint coordinates, which is consistent with our CA-MSN network. This consistency facilitates subsequent tandem deployment. In addition, due to the serial deployment of three networks, it is difficult to guarantee the speed of calculation, so reducing the amount of calculation in each step is important. The SemGCN we choose has an order of magnitude smaller model size than other algorithms. Figure 10 shows the results of the 3D skeleton extraction experiment using the Human3.6M dataset. The first line is the extracted 2D skeleton; the second line is the 3D skeleton obtained by the 2D to 3D pose regression. We select some representative frames in several continuous actions for visualization. Comparing the predicted 3D skeleton sequence with the ground truth, the error is within an acceptable range. The observation confirms that it is feasible to predict the action using 3D skeleton sequences which are regressed from their corresponding 2D projections.

OpenPose-SemGCN-CA-MSN joint deployment test. The actions from top to bottom are eating, greeting and taking pictures. The first line of each action is RGB image + 2D Pose. The second line of each action is 3D Pose.
In actual applications, multiple models need to work together to realize the action recognition function. The end-to-end (RGB video-to-action type) network is obtained by serially deploying the three models of OpenPose, SemGCN and CA-MSN. Figure 10 shows the test results of the joint deployment. We intercept video clips of 3 different actions (eating, greeting and taking photos) as test samples, and visualize the RGB image + 2D posture and 3D posture. In the process of testing the overall model, the three submodels bring huge cumulative errors. There are two connection points in the series network. The first is the 2D pose output by OpenPose as the input of SemGCN, and the second is the 3D pose output by SemGCN as the input of CA-MSN. Our improved OpenPose achieves the single-person AP of 91.3% based on the Human3.6M dataset when the OKS threshold is 0.5. Therefore, 8.7% of the 2D poses output by OpenPose have a large deviation from the ground truth 2D poses in Human3.6M. The actual 2D pose input to SemGCN is significantly different from the standard input. Similarly, the actual 3D pose input to CA-MSN is also significantly different from the standard input. Errors in the two stages cause that the SemGCN and CA-MSN models tested in our series network perform far worse than tested with standard data. To solve this problem, OpenPose is first pretrained on the coco dataset, and then fine-tuned on the Human3.6M dataset. SemGCN uses the prediction results of the OpenPose model fine-tuned on the Human3.6M dataset as the training input, and still the 3D skeleton ground truth as the label value. The skeleton action recognition model CA-MSN also needs to be retrained on the Human3.6M dataset. After the above training mode changes, it can be ensured that the domain relationships of the three models are relatively close. The serious influence caused by the cumulative errors of the series models is prevented.
Comparison with video-based models in terms of parameters, calculations, FPS (measured on NVIDIA GTX2080Ti GPU), and accuracy
Comparison with video-based models in terms of parameters, calculations, FPS (measured on NVIDIA GTX2080Ti GPU), and accuracy
This chapter combines the skeleton action recognition method with the pose estimation to form an integrated network. The network uses the skeleton output of the pose estimation to classify the actions in the video. Through the application, it shows that compared with video-based recognition methods, skeleton action recognition has more steps and cumulative error problems. As shown in Table 7, we select several of the advanced video-based action recognition methods in recent years to compare with our skeleton-based action recognition method in terms of parameters, calculations, speed and accuracy. O-S-C denotes our OpenPose-SemGCN-CA-MSN series network. In terms of accuracy, our method outperforms the RGB video-based methods with the same computational complexity. Because video-based methods are easily disturbed by visual features such as background and clothing, they are less robust than skeleton-based methods. In terms of computational complexity, our approach with 11.81M parameters and 15.54 GFLOPs calculation has no obvious advantage. Because the 2D pose estimation occupies most of the computing resources (84.7% in parameters, 80.1% in FLOPs). In future work, reducing the weight of the 2D pose estimation method will make the skeleton-based action recognition method full of potential in computational speed. Notably, our series network achieves 5.53% fewer GFLOPs than the Gate-Shift method, but a 40.81% drop in FPS. This illustrates that series network has lower computational efficiency with similar computational complexity. Making the series network more holistic is a direction to solve this problem.
Viewpoint invariance
Viewpoint invariance is indispensable for the practical application of action recognition algorithms. Recognition networks generally use pictures from several fixed views during training, but in practical applications, human actions may be observed from many different views. In Table 8, we compare the viewpoint invariance differences of video, 2D skeleton, and 3D skeleton-based methods using Cross View settings. 3-TrainV denotes three of four views are used for training, and 1-TestV denotes one of four views are used for testing. When using three views for training and one view for testing, the 3D skeleton-based method achieves the best performance of 90.4%, the video-based method achieves 84.6%, and the 2D skeleton-based method performs the worst, only 75.2%. In the more difficult task of using 2 views for training and 2 views for testing, the 3D skeleton-based method drops by 6.3%, the video-based method drops by 8.7%, and the 2D skeleton-based method drops by 20%. In conclusion, 3D skeleton method > video method > 2D skeleton method in terms of the viewpoint invariance. The results of 2D skeleton-based method are almost catastrophic. The main reason is that it is difficult to learn the 3D spatial motion relationship between joints directly from 2D poses. And because the 2D-3D pose regression method has a small number of parameters, the 2D skeleton-based method has a weak advantage in computational complexity. Based on the above analysis, the 3D skeleton-based method has obvious advantages in view invariance.
With Cross View setting, compare the robustness of the three methods (video-based, 2D skeleton-based, 3D skeleton-based) to perspective changes
With Cross View setting, compare the robustness of the three methods (video-based, 2D skeleton-based, 3D skeleton-based) to perspective changes
In this work, we propose a new time-space series network based on channel attention GCN and multi-scale TCN to improve the accuracy of skeleton action recognition. We explore the characteristics of the channel attention mechanism in the GCN network while extracting skeleton joint features. In addition, the processing method of multi-scale dilated convolution makes the temporal receptive field more abundant and positively affects actions of different cycle periods. Furthermore, We use the global maximum pooling for each frame of joint features to connect the spatial module and the temporal module. The specialization of feature extraction can improve the efficiency of modeling feature representation. The final model exceeds the current state-of-the-art performance on two large-scale datasets: NTU-RGBD60 and NTU-RGBD120. In the end, we design a OpenPose-SemGCN-CA-MSN network to realize the end-to-end (RGB video-to-action type) application. From the application, we discover that it is difficult to train the sub-models separately and then test the whole model in series. Therefore, the focus of future research will be how to design a multi-task framework for jointly estimating 2D or 3D human poses from color images and classifying human actions from video sequences. Meanwhile, the interpretability of GCN will also play a critical role in skeleton action recognition.
Footnotes
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant U1713211, Grant 62073245, and Grant 61733013; in part by Pudong New Area Science and Technology Development Fund under Grant PKX2019-R18; and in part by Jiangsu Key Research and Development Project under Grant BE2020101.
