Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition

Abstract

Unsupervised action recognition based on spatiotemporal fusion feature extraction has attracted much attention in recent years. However, existing methods still have several limitations: (1) The long-term dependence relationship is not effectively extracted at the time level. (2) The high-order motion relationship between non-adjacent nodes is not effectively captured at the spatial level. (3) The model complexity is too high when the cascade layer input sequence is long, or there are many key points. To solve these problems, a Multiple Distilling-based spatial-temporal attention (MD-STA) networks is proposed in this paper. This model can extract temporal and spatial features respectively and fuse them. Specifically, we first propose a Screening Self-attention (SSA) module; this module can find long-term dependencies in distant frames and high-order motion patterns between non-adjacent nodes in a single frame through a sparse metric on dot product pairs. Then, we propose the Frames and Keypoint-Distilling (FKD) module, which uses extraction operations to halve the input of the cascade layer to eliminate invalid key points and time frame features, thus reducing time and memory complexity. Finally, the Dim-reduction Fusion (DRF) module is proposed to reduce the dimension of existing features to further eliminate redundancy. Numerous experiments were conducted on three distinct datasets: NTU-60, NTU-120, and UWA3D, showing that MD-STA achieves state-of-the-art standards in skeleton-based unsupervised action recognition.

Keywords

3D human motion prediction distilling unsupervised attention

1. Introduction

With the rapid development of science and technology, artificial intelligence has become one of the most important development directions in all walks of life. Human motion recognition, as an important field, has been widely used in many fields such as human-computer interaction, film and television production, semi-automatic driving, motion analysis, and game entertainment [1, 2, 3]. While supervised human motion models have achieved remarkable success, they still face significant challenges in the face of large-scale data sets, real-world engineering environments, and production scenarios. First of all, for very large data sets, we cannot fully label and annotate them, which will require a lot of manpower and resources. Second, due to cognitive errors, the labels we give may be inaccurate or untrue to a certain extent.

Various formats can be used to represent motion data, including video (RGB), depth (+D), and 3D skeletal keypoint data. 3D skeleton key point data is more difficult to process than other data, but it can describe human motion more accurately while avoiding the influence of body shape and clothing as much as possible, and its robustness can deal with the interference of environmental noise. Therefore, in this study, we use the 3D skeleton key point data as the basic data.

At present, there are many unsupervised training models, but they can be roughly divided into three categories according to their overall structure. The first is the encoder-decoder model [4, 5, 6], where the input skeleton sequence first enters the encoder for various processing, then becomes hidden features, and finally, the hidden features enter the decoder to get the output sequence. Zheng et al. [4] extracted long-term global motion dynamics in skeleton sequences and designed a conditional skeleton inpainting architecture for learning fixed-dimensional representations. Nie et al. [5] proposed a new Siamese denoising autoencoder, which uses the pose-dependent and view-dependent features separated from human skeleton data as a three-dimensional pose representation, and combines the kinematics and geometry of the human skeleton. A sequential bidirectional recurrent network (SeBiReNet) is proposed to model human skeleton data, Ahn et al. [6] proposed the Spatiotemporal Crossover (STAR) Transformer, which can effectively represent two cross-modal features, spatiotemporal features, and skeleton features, into one identifiable vector. The current Encoder-Decoder model has been very mature with the help of various tasks, such as skeleton reconstruction [4], skeleton coloring prediction [7], and skeleton displacement prediction [8], and has achieved significant results. The second method is contrastive learning [9, 10, 11]. Because of its high degree of freedom of positive and negative sample definition and excellent performance, it has become an important research direction in unsupervised learning. This method is mainly to learn the common features between similar instances and distinguish the differences between different instances. Rao et al. [9] proposed a contrastive action learning paradigm called AS-CAL, which gives skeletons different strengths of augmentation to learn action representations. Thoker et al. [10] let the network learn higher-level semantics of 3D skeleton data by learning the similarity between different skeleton representations and enhanced views of the same sequence. Gao et al. [11] proposed a new method for contrastive self-supervised learning. Use enough viewpoints for comparison, and finally, select distinguishing features for action recognition. Contrastive learning does not need to pay attention to the cumbersome details of the instance but only needs to learn to distinguish the data in the feature space of the abstract level. Therefore, the complexity of the model is low, and the optimization of the model becomes simple. At the same time, it also has a strong generalization ability. A third method was born at the same time: Hybrid method [12, 13, 14]. A hybrid approach is a variant that combines an encoder-decoder model with a contrastive learning model. The parameters and optimization process of this model are extremely complex but have achieved some success so far. Su et al. [12] proposed a predict & cluster model, using a weakened decoder, namely a fixed weight decoder, to force the encoder to learn more effective features and then drop the learned effective features into a linear classifier for classification after dimensionality reduction. Su et al. [13] propose a new self-supervised learning (SSL) method to learn 3D skeleton representation, interpolate skeleton data to construct continuous skeleton data, and model it to enhance learning features. Chen et al. [14] proposed a self-supervised hierarchical pre-training scheme to learn spatial, short-term, and long-term temporal dependencies.

From the perspective of feature extraction, current skeleton sequence processing methods mainly start from two perspectives of time and space. At the time level, many previous methods used RNN to model time series [4, 12, 15, 16], Zheng et al. [4] adopted a new conditional framework to capture the long-term global motion dynamics of different length sequences, Su et al. [12] used RNN to extract features and reduce dimensions of time series. Lin et al. [15] learn time characteristics by solving jigsaw puzzle games, and Gao et al. [16] propose Multi-ScaleTCN to capture richer features between adjacent frames. At the spatial level, many existing methods process the overall skeleton sequence and extract the features of the global skeleton sequence [1, 15, 16]. Su et al. [1] use a weakened decoder to force the encoder to learn more global skeleton features. Lin et al. [15] models skeleton dynamics through motion prediction by predicting future sequences. Gao et al. [16] introduce the perception of joint types in the analysis of motion dynamics. Our approach takes into account the characteristics of both perspectives.

However, these methods have the following problems:

(1)
Part of the method from the perspective of time only focuses on the features between adjacent frames and rarely extracts the connections between frames that are far away in the sequence, which will not only lose a lot of key information but also cannot capture long-term dependencies well. Although the model [14, 17, 18] applying the Transformer method extracts all the features between frames, most of the extracted features are actually invalid, and messy and invalid features will affect the prediction results of the decoder. As shown in Fig. 1, the key points of the last few frames of the drinking action change very little. Most of the methods from the perspective of space only convert the skeleton sequence into a sequence vector or a two-dimensional pseudo image. Most of the methods from the perspective of space only convert the skeleton sequence into a sequence vector or a two-dimensional pseudo image. Although the relatively complete features of the skeleton sequence are extracted from it, they do not pay special attention to the important features at the joint points and even ignore the internal linkages of a single key point. Different dimensions of a single key point can extract different features, and different three-dimensional perspectives have different effects on different actions. Even if there are a small number of methods to extract a large number of bone key point features [19], these features, like the method of extracting features from a sequence perspective, focus on a large number of invalid key points. In fact, in the task of human skeleton action recognition, the displacement of some key points from the first frame to the end of the last frame is very small, and the extracted features of these key points are not only invalid but also affect the effectiveness of the spatial features.

Figure 1.
Visualization of the original skeleton sequence ‘S001C001P002R002A001.skeleton’, The last three frames of the drinking action are frame 82, frame 83, and frame 85. The key points of each frame change little compared to the key points of other frames, and the attention between them is basically invalid. Therefore, we ignore these frames when calculating attention and eliminate these frames in the distilling module.

(2)
For the method that extracts all the features between frames and extracts a large number of bone key features, the spatiotemporal complexity is higher. When the input sequence is longer, the key points are more, and there are multiple encoder/decoder layers stacked, the stacking of M encoder/decoder layers makes the total memory usage of the time stream $\mathcal{O}\left(M\cdot L^{2}\right)$ . The total memory usage of the spatial stream is $\mathcal{O}\left(M\cdot N^{2}\right)$ , due to various invalid temporal and spatial features, the normal training becomes very difficult, and the hardware configuration requirements are too high, which limits the scalability of the model when receiving long sequences of inputs.

In response to the first problem, we proposed a module called distilling and dim-reduction-based attention (DDA), which combines a self-attention mechanism and traditional RNNS to co-capture the temporal and spatial features of 3D skeleton sequences. It extracts the relationship between distant frames in time series and distant key point coordinates in spatial series through Screening Self-attention (SSA)module, and further captures the long-term dependence between time frame sequence and spatial key point sequence. By Frames and Keypoint-Distilling (FKD)module, invalid frames and invalid key point coordinates are eliminated to reduce the influence of clutter features.

For the second problem, we propose Multiple frames and the Keypoint-Distilling-attention module, which selects the best dot product pairs and avoids the calculation of invalid dot product pairs. The computation amount of a single attention block is reduced from $\mathcal{O}\left(L^{2}\right)$ to $\mathcal{O}(L\log L)$ , and the input spatiotemporal sequence is continuously halved, thus greatly reducing the total computation amount, from $\mathcal{O}(L\log L)$ to $O\left(2\left(1-1/2^{\text{M}}\right)\text{L}\ln\text{L}+\text{c}\right)$ in the time stream. In a spatial stream, it decreases from $\mathcal{O}(N\log N)$ to $O\left(2\left(1-1/2^{\text{M}}\right)\text{N}\ln\text{N}+\text{c}\right)$ , where c is a relatively small negative number.

Figure 2.
A description of our framework: It is a two-flow Embedding encoder-decoder architecture. It is composed of two independent parts, scalar projection and local time stamp (location), realizing location coding and preserving local context. The middle module of Embedding is the encoder, and the encoder will pass the learned features to the decoder to generate a new sequence. The upper branch (yellow) is the feature extraction process of the time frame, while the lower branch (blue) is the feature extraction process of key points in space. Each branch must pass through three layers of Attention and two layers of Distilling. Take the square blue branch as an example. The luminescence of the key points indicates the key points with stronger features in Attention. After passing Distilling, the luminous key points are saved for the next step of Attention, while the non-luminous key points are distilled away by the Distilling module.

Based on the above two designs, we propose a Multiple Distilling-based spatial-temporal attention (MD-STA) networks. Our model is divided into two spatial-temporal paths, as shown in Fig. 2. It is composed of Embedding, Screening Self attention (SSA), Frames and Keypoint-Distilling (FKD), and Dim-reduction Fusion (DRF) modules.

A large number of experiments have been carried out on the three data sets of NTU-60, NTU-120, and UWA3D. We prove that the proposed MD-STA model can perform well under the benchmarks of different divided data sets, including cross-view, cross-subject, and cross-setup, significantly outperforms other methods, and our visualization results, including t-SNE and confusion mastic, also demonstrate the effectiveness of our model. The main contributions of this work are summarized as follows:

•
We propose a Multiple Distilling-based Spatial-temporal Attention Network(MD-STA), which aims to capture effective and comprehensive motion correlations at temporal and spatial channel-wise and resolve the problem of the extraction of long-term dependencies and the capture of high-order motion relationships between non-adjacent nodes for motion relationship learning.
•
In MD-STA, we propose Multiple Frames and keypoint- Distilling-attention based on probability, which contains two core modules: Screening Self-attention(SSA) to extract the most effective motion representations at temporal and spatial level, respectively, frames and Keypoint- Distilling ( FKD ) module to reduce the training process time complexity and memory consumption.
•
Extensive experiments were conducted to verify our model’s advanced nature quantitatively. In the cross-subject and cross-view of the NTU-60 dataset, the model accuracy is 21.5% and 7.2% higher than the existing methods, respectively. In the cross-subject and cross-setup of the NTU-120 dataset, the model accuracy is 11.7% and 14.8% higher than the existing methods, respectively.

2. Related work

2.1 Unsupervised action recognition

Action recognition based on unsupervised skeletons, that is, without assigning label classes to data, learns important features of skeleton actions from a large amount of unlabeled data and classifies actions. Due to its lightweight and good robustness, this work has attracted extensive attention from researchers. Zheng et al. [4] originally proposed GAN encoders, which was a milestone in unsupervised training. SU et al. [12] proposed a weak decoder model. Since the decoder needs to regenerate the sequence, this forces the encoder to generate more powerful features and ultimately trains a better encoder. Lin et al. [15] Combined various methods, such as motion prediction and bone reconstruction, to learn the features of bone sequences from different aspects. Ahn et al. [6] Proposed the space-time intersection (STAR) converter, which can effectively fuse the two cross-modal features of space-time features and skeleton features. Gao et al. [16] and Rao et al. [9] proposed a contrastive learning approach, which is also a recent mainstream approach, to reduce the gap between classes. Thoker et al. [10] made the network learn a higher level of 3D skeleton data semantics by learning the similarity between different skeleton representations and enhanced views of the same sequence. Cheng et al. [14] proposed a layered converter to learn more spatial features. In our work, our encoder effectively integrates multidimensional features and improves the encoder’s performance by combining a weak decoder.

2.2 Attention mechanism

The attention Mechanism is a unique structure used to learn and calculate the contribution of input data to output data. Badanau et al. [20] first proposed addictive attention, which improved word misalignment in the translation task of the encoder-decoder structure. Then Luong et al. [21] proposed location, general and dot-product attention, significantly improving the accuracy of English-German two-way translation. Later, Vaswani et al. [22] proposed a self-attentional Transformer model with higher parallelism, significantly improving the BLEU score and reducing the time spent on model training. Ma et al. [23] propose Cross-Dimensional Self Attention(CDSA) to capture attention in multiple dimensions, such as time and space, and reduce computational complexity. Child et al. [24] proposed a sparse decomposition of attention mechanisms for fast training models. In our work, our model combines attention mechanisms with traditional RNNS to capture multidimensional features of 3D skeleton sequences.

2.3 Extraction of temporal and spatial features

Earlier methods represent data as 2D and 3D coordinate vector sequences and use CNN and RNN to drive data [25, 26, 27, 28]. Liu et al. [26] have proposed a method to calculate the size and direction of 3D skeleton sequences. Zhang et al. [27] have proposed a method to convert 3D skeleton sequences into pseudo-images. Dedeoglu et al. [29] proposed a method to realize real-time human action recognition by using contour difference degree, and Yu et al. [19] proposed to use an adaptive convolutional neural network to learn spatiotemporal features. However, the extraction effect of these models on the features of spatial key points is too low, and the interaction of some human key points is basically ignored, resulting in the loss of a large number of features, and the human body mechanics are not taken into account. Liu et al. [28]proposed using RNN to capture the global dependence of human motion from 3D skeletal motion sequences. However, this cannot extract features between pairs of skeletal joints in the human body. Cheng et al. [30] proposed the GCNs method to capture the geometric space dependence of joints in the three-dimensional skeletal body and achieved good results. Peng et al. [31] proposed a dynamic GCN to improve the capability of the model. Liu et al. [32] proposed multi-scale spatio-temporal GCNs to enhance the ability of the model to extract spatio-temporal features. However, their performance in the extraction of key features that are far away from each other is still poor. Spatiotemporal feature fusion is still the bottleneck of existing methods. Some methods only extract time features or space features. Even for the models with time feature and space feature extraction methods, the performance of the integrated features of the two dimensions is not optimistic, even lower than that of the branch method, due to the different semantics represented by different dimensions of the two features. In our work, our model obtains excellent results in extracting features of spatial key points, not only extracting features among a large number of key points but also extracting the internal relations of a single key point.

3. Method

In this part, we first preprocess the input sequence, including downsampling, noise reduction, and removing a few frames, so as to adapt to our model. After reshaping, the processed input sequences are respectively mapped to higher dimensional Spaces on the Embedding layer, to realize position encoding and retain the local context. After mapping sequences into our Screening Self – attention (SSA) module, get a lot of features in the module after entering Frames and Keypoint Distilling (FKD) module, a large number of features are distilled into a few more important ones. In fact, the data is entered twice (SSA) and (FKD) modules to get the most important features. Then, the data flows into the Dim-reduction Fusion (DRF) module to be dimensionally reduced, and the dual-way time and space features are fused to get the final feature. The task of 3D skeleton action recognition is completed with the KNN classifier.

At the time level, our work maps the spatial key points of each frame sequence to a higher dimensional space, achieves position coding in high dimensional time, and retains the local context. Secondly, we calculate the partial dot product of the overall time frame sequence to obtain a large number of time features. Moreover, we selectively calculate the attention probability distribution $p\left(\textbf{k}_{j}\mid\textbf{q}_{i}\right)$ to distinguish several pairs of dot products that make up our main attention, where $\textbf{q}_{i}$ , $\textbf{k}_{j}$ stand for the i-th row in Q, K respectively. We use Screening Self-attention (SSA) to effectively replace the full self-attention mechanism and strive to calculate the relationship between some frames with the least amount of computation. And make sure our model doesn’t miss the most important frame in the entire sequence. It reduces the memory usage and time complexity of each layer from $\mathcal{O}\left(L^{2}\right)$ to $\mathcal{O}(L\log L)$ (L is the sequence length). We then use frames Distilling to highlight high-scoring attention; by eliminating low-scoring attention, our work avoids the effect of ineffective time features on training, reduces memory usage, and thus dramatically reduces the total space complexity, from $\mathcal{O}(L\log L)$ to $O\left(2\left(1-1/2^{\text{M}}\right)\text{L}\log\text{L}+\text{c}\right)$ (M indicates the number of stacked layers), and not only that, The existence of the distillation module greatly increases the length of the input sequence, no matter how long the sequence L will eventually become L/4, showing superior performance in capturing long-term dependencies.

Figure 3.

This is our time flow branch detail diagram, an illustration of our framework: This is an Embedding-encoder-decoder architecture. Embedding consists of two independent parts: scalar projections (blue blocks) and local timestamps (yellow blocks). The 75-dimensional key points enable position encoding and preserve local context. The middle gray module is an encoder comprising the above six modules and realizes operations such as feature extraction and distillation. The encoder passes the learned fusion features (purple blocks) to the decoder to generate a new sequence, and the generated sequence and the forward evolution sequence calculate the loss to complete the human action recognition task.

At the spatial level, we map the spatial key points of each frame sequence to a higher dimensional space, realize position coding in the high dimensional space, and retain the local context. Secondly, we calculate the partial dot product of the overall key point sequence to obtain a large number of spatial features. We not only calculate the dot product of the key points that are far away or near, we also compute the dot products of different dimensional coordinates for different key points and the dot products of different coordinates for the same key points, and we selectively compute the attention probability distribution $p\left(\textbf{k}_{j}\mid\textbf{q}_{i}\right)$ to distinguish several pairs of dot products that make up our main attention. We use Screening Self-attention (SSA) to effectively replace the full self-attention mechanism, strive to calculate the relationship between some key point coordinates with the least amount of computation, and ensure that the more effective key point coordinates in the spatial sequence are not missed. It reduces the memory usage and time complexity of each layer from $\mathcal{O}\left(N^{2}\right)$ to $\mathcal{O}(N\log N)$ (N is the number of key point coordinates). We use Keypoint Distilling to highlight the key point coordinate attention with high scores. By eliminating the key point coordinate attention with low scores, we greatly reduce the impact of invalid spatial features on training. Moreover, the existence of the distillation module greatly increases the number of input key points, no matter how many key points coordinates N will eventually become N/4, showing superior performance in handling a large number of key point dependencies.

The output after processing by the above modules is divided into two parts: spatial flow output and time flow output. They respectively obtain the output of the encoder through the Dim-reduction Fusion (DRF) module. In the Dim-reduction Fusion (DRF) module, we use the two-way flow, which not only reduces the dimension of the time feature and the space feature but also reduces the redundancy of the time feature and the space feature after distillation and integrates them together. It fits our current model better than other neural networks, and this module improved the overall model score by 4 percentage points. Finally, we improve the feature extraction ability of the encoder combined with the fixed-state decoder.

Time frame and key point data preprocessing: We preprocess the input data so that it fits our model better. The input form for the data used in this article is as follows.

In the time flow branch, Given an input skeleton sequence $\mathcal{X}=\left\{\mathcal{X}^{t}\mid t=\right.1,\ldots,K\}$ , where K Represents the number of input frames. Every frame contains H joints and the $h^{\text{th}}$ joint in the $t^{\text{th}}$ frame is denoted as $v_{t,h}=\left[x_{t,h},y_{t,h},z_{t,h}\right]$ . Finally $t^{\text{th}}$ frame is denoted as

$\displaystyle\mathcal{X}^{t}=\left\{v_{t,h}=\left[x_{t,h},y_{t,h},z_{t,h}% \right]\mid t\in\{1,\ldots,K\};h=1,\ldots,H\right\}.$ (1)

In the space flow branch, Given an input skeleton sequence $\mathcal{X}=\left\{\mathcal{X}^{\text{n}}\mid n=1,\ldots,N\right\}$ , where N Represents the number of coordinates of the input key point, each key point coordinate contains T frames and the $t^{\text{th}}$ frame in the $n^{\text{th}}$ Key point coordinates is denoted as $v_{n,t}$ . Finally $n^{\text{th}}$ key point coordinate is denoted as

$\displaystyle\mathcal{X}^{n}=\left\{v_{\text{n},\text{t}}\mid n\in\{1,\ldots,N% \};t=1,\ldots,T\right\}.$ (2)

In order to enhance the learning ability of the model, we randomly zero two frames of each action of the training data and keep the test data unchanged. In this way, the training difficulty is improved, the features encoded by the encoder are stronger, and the robustness of the model is improved.

Hyper-parameter search: We use hyperparameter search so that the network can achieve the best performance. The optimization of hyperparameter search can indeed be completed one by one in small-scale cases. It is the variable control method, which keeps one variable unchanged, adjusts the value of the other variable, and then repeats the experiment all the time. This is also true in the case of higher dimensional variables. They interact with each other, so in order to get a more accurate model and achieve better-expected results, we use the standard grid search method. Each combination of hyperparameter values requires end-to-end training of the model. Our hyperparameters, such as embedding dimension, number of convolution layers, number of RNN cells, etc., are essentially small search space, and we want to make sure that we find the best option, so we use the standard grid search method, which tries all possible combinations to get the optimal solution that we need.

Temporal and Spatial Sequence embedding: We used the Embedding layer to preserve the local context, specifically, Given an input skeleton sequence $\mathcal{X}=\left\{\mathcal{X}^{t}\mid t=\right.1,\ldots,K\}$ and the feature dimension after mapping is $d_{\text{model }}$ . We first use fixed-position embedding to preserve local context:

$\displaystyle\text{PsEb}_{(pos,2n)}=\sin\left(\text{ pos }/\left(2L_{x}\right)% ^{2n/d_{\text{model}}}\right)$ (3) $\displaystyle\text{PsEb}_{(pos,2n+1)}=\cos\left(\text{ pos }/\left(2L_{x}% \right)^{2n/d_{\text{model }}}\right),$

where $n\in\left\{1,\ldots,\left\lfloor d_{\text{model }}/2\right\rfloor\right\}$ , $L_{x}$ is the length of the input sequence, pos represents the position of the frame in the sequence, $d_{\text{model }}$ is the dimension of the frame vector (512 in this article), $2n$ represents the even dimension in $d_{\text{model}}$ , and $(2n+1)$ represents the odd dimension.

We project the scalar context $\textbf{x}_{i}^{t}$ into $d_{\text{model }}$ -dim vector $\textbf{u}_{i}^{t}$ . In this process, we use 1-D convolutional filters (kernel width $=$ 3, stride $=$ 1):

$\displaystyle\textbf{X}_{\text{feed }[i]}^{t}=\alpha\textbf{u}_{i}^{t}$ (4)

where $\alpha$ is the factor balancing the magnitude between the scalar projection and local embeddings. In our model, $\alpha=$ 1 because the sequence input has been normalized. Finally, we have the feeding vector:

$\displaystyle\mathcal{X}_{\text{feed }[i]}^{t}=\textbf{X}_{\text{feed}[i]}^{t}% +\text{PsEb}_{\left(L_{x}\times(t-1)+i,\right)}$ (5)

where $i\in\left\{1,\ldots,L_{x}\right\}$ , $t\in\left\{1,\ldots,K\right\}$ .

Screening Self-attention: We propose Screening Self-attention so that the network can better represent the skeletal action with less computation. The full self-attention in [22] is about query, key, and value.

The formula for this is $\mathcal{A}(\textbf{Q},\textbf{K},\textbf{V})=$ $\operatorname{Softmax}(\textbf{Q K}^{\top}/\sqrt{d})\textbf{V}$ , where $\textbf{Q}\in\mathbb{R}^{L_{Q}\times d},\textbf{K}\in\mathbb{R}^{L_{K}\times d}$ , $\textbf{V}\in\mathbb{R}^{L_{V}\times d}$ and $d$ is the input dimension. For the sake of further discussion, Following the formulation in [33], the $i$ -th query’s attention is defined as a kernel smoother in a probability form:

$\displaystyle\mathcal{A}\left(\textbf{q}_{i},\textbf{K},\textbf{V}\right)=\sum% _{j}\frac{k\left(\textbf{q}_{i},\textbf{k}_{j}\right)}{\sum_{l}k\left(\textbf{% q}_{i},\textbf{k}_{l}\right)}\textbf{v}_{j}=\mathbb{E}_{p\left(\textbf{k}_{j}% \mid\textbf{q}_{i}\right)}\left[\textbf{v}_{j}\right]$ (6)

where $\textbf{q}_{i},\textbf{k}_{i},\textbf{v}_{i}$ is the $i$ -th row in $\textbf{Q},\textbf{K},\textbf{V}$ , $k(\textbf{q}_{i},\textbf{k}_{j})$ selects the kernel $\exp(\textbf{q}_{i}\textbf{k}_{j}^{\top}/\sqrt{d})$ and $p\left(\textbf{k}_{j}\mid\textbf{q}_{i}\right)=k\left(\textbf{q}_{i},\textbf{k% }_{j}\right)/\sum_{l}k\left(\textbf{q}_{i},\textbf{k}_{l}\right)$ , Screening Self-attention calculates the probability by $p\left(\textbf{k}_{j}\mid\textbf{q}_{i}\right)$ and then combines the values into the output. It has a huge amount of computation and occupies a very high memory, which has become one of the bottlenecks of recognition ability.

However, the actual situation is that only a few frames and in the whole sequence have effective attention, and most frames actually have very limited or even ineffective attention with other parts of the frame, so we only select a small number of frames and key points to calculate the attention, which greatly reduces the computational complexity.

$\displaystyle\mathcal{A}(\textbf{Q},\textbf{K},\textbf{V})=\operatorname{% Softmax}\left(\frac{\overline{\textbf{Q}}\textbf{K}^{\top}}{\sqrt{d}}\right)% \textbf{V},$ (7)

Where $\overline{\textbf{Q}}$ is the selected q, which is the largest several values after calculating $M(\textbf{q},\textbf{K})$ . We reform formula Kullback-Leibler divergence to get formula $M(\textbf{q},\textbf{K})$ ;

$\displaystyle M\left(\textbf{q}_{i},\textbf{K}\right)=L_{K}\ln\sum_{j=1}^{L_{K% }}e^{\frac{\textbf{q}_{i}\textbf{k}_{j}^{\top}}{\sqrt{d}}}-\sum_{j=1}^{L_{K}}% \frac{\textbf{q}_{i}\textbf{k}_{j}^{\top}}{\sqrt{d}}$ (8)

Set the sampling factor as a, then $b=a\cdot\ln L_{Q}$ . Therefore, we only need to calculate $\mathcal{O}\left(\ln L_{Q}\right)$ dot products and the computation amount is greatly reduced, and the layer memory becomes $\mathcal{O}\left(L_{K}\ln L_{Q}\right)$ . We use the multi-head perspective to get the best Top- b and ensure we don’t miss any critical information. We only extract $B=L_{K}\ln L_{Q}$ dot-product pairs to calculate the $\bar{M}\left(\textbf{q}_{i},\textbf{K}\right)$ , and since some of the operators in $\bar{M}\left(\textbf{q}_{i},\textbf{K}\right)$ are not sensitive to 0 and are numerically stable, we set the other dot products to 0. In our time flow model, $L_{Q}=L_{K}=L=50$ , So our time complexity and space complexity went from $\mathcal{O}(L^{2})$ to $\mathcal{O}(L\ln L)$ .

Figure 4.

A detailed introduction to the encoder architecture. The dark gray square represents Screening Self-attention, which consists of the above four parts. The red layer represents the n-head attention mechanism, and the remaining three layers are other operations. The light gray square represents Self-attention Distilling, which contains two layers of different operations to distill the output of the dark gray square. The red square on the far right is a two-layer bidirectional RNN, and each layer consists of 2048 basicRNN units.

Frames and Keypoint-Distilling: We proposed Frames and Keypoint-Distilling to reduce the effect of ineffective features on training. Since the feature mapping of the encoder is a redundant combination of values v, we use a distillation operation to filter and accumulate the features so that the previous sequence length L is distilled to L/2. The key point before N is distilled into N/2. In fact, we repeat the operation twice, and the sequence length is finally distilled to L/4, and the number of key point coordinates is finally distilled to N/4. After the distillation operation, the next layer of essential feature combinations is generated, and the Screening Self-attention continues, as shown in Fig. 3. Inspired by the dilated convolution [34, 35], our “distilling” procedure forwards from $j$ -th layer into $(j+1)$ -th layer as:

$\displaystyle\textbf{X}_{j+1}^{t}=\text{MaxPool}\left(\text{ELU}\left(\text{% Conv}1\text{∼{}d}\left(\left[\textbf{X}_{j}^{t}\right]_{\text{AB}}\right)% \right)\right)$ (9)

where $\left[\textbf{X}_{j}^{t}\right]_{\text{AB}}$ represents the attention block, as shown in Figs 4 and 5. It’s made up of n-head Screening Self-attention and other operations, where Conv1d(.) performs a 1-D convolutional filter (kernel width $=$ 3) on time dimension with the $\operatorname{ELU}(\cdot)$ activation function [36]. We add a max-pooling layer with stride two and downsample $\textbf{X}^{t}$ into its half slice after stacking a layer, So the space-time complexity of the next layer after distillation is $O\left(\frac{L}{2}\ln\frac{L}{2}\right)$ . So, the last level of the pile is $\mathcal{O}\left(\frac{\text{L}}{2^{\text{M}-1}}\ln\frac{\text{L}}{2^{\text{M}% -1}}\right)$ . So the total memory usage goes down to:

$\displaystyle\mathcal{O}\left(\sum_{i=1}^{M}\frac{\text{L}}{2^{\text{i}-1}}\ln% \frac{\text{L}}{2^{\text{i}-1}}\right)=\mathcal{O}\left(2\left(1-1/2^{\text{M}% }\right)\text{L}\ln\text{L}+\text{c}\right).$ (10)

Where $c$ is a small negative argument, $M$ indicates the number of stacked layers.

Dim-reduction Fusion (DRF): We propose a bidirectional flow so that the network can capture better long-term dependencies in the sequence of actions. Specifically, the network is a multilayer bidirectional gated loop unit (BasicRnn), its input is Feature0 after being processed by the embedding layer and the self-attention layer.

The final states of each encoder branch are $F_{Ti}=\left\{\overrightarrow{F_{T}},\overleftarrow{F_{T}}\right\}$ , where $\overrightarrow{F_{t}}$ is the forward hiding state, $\overleftarrow{F}_{t}$ is the reverse hiding state, and the final state of the encoder is $F_{T}=F_{T1}+F_{T2}$ .

Fixed States decoder: We use a fixed-state decoder to improve the overall feature extraction capability of the encoder. The decoder includes internal I/O and external I/O. In this decoder configuration, the encoder’s final state is the input to the internal decoder, the output of each time step is the external input to the next time step, the external input to the decoder is conditional (the external input of each time step length is the output of the previous time step), and the internal input, usually the hidden state of the previous step, is replaced by the encoder’s final state. Our decoder consists of RNNS, that is, in RNN cells

$\displaystyle h_{t}=f\left(Ux_{t}+Wh_{t-1}+b\right),h_{t-1}\rightarrow F_{T},$ $\displaystyle y_{t}=\text{softmax}\left(Vh_{t}\right),$ (11) $\displaystyle x_{t+1}=y_{t},$

where $x_{t}$ the input, $h_{t}$ the hidden state and $y_{t}$ the output at time-step $t,h_{t-1}$ terms are replaced by $F_{T}$ . The final state of the encoder $F_{Ti}$ is fed into the decoder as its initial state $D_{0}$ , i.e., $D_{0}=F_{Ti}$ . Next, the decoder generates the new sequence $\hat{X}=\left\{\hat{x}_{1},\hat{x}_{2},\ldots,\hat{x}_{T}\right\}$ with the forward evolution $X$ of the sequence itself doing the mean square error $L=\frac{1}{T}\sum_{t=1}^{T}\left(x_{t}-\hat{x}_{t}\right)^{2}$ (MSE).

Table 1

The MD-STA network components in detail

MD-STA
Embedding	1 $\times$ 3 Conv1d ( $d=$ 512)
Screening Self-attention	Multi-head Screening Attention ( $h=$ 8, $d=$ 64)
	Add, LayerNorm, Dropout ( $p=$ 0.1)
	Pos-wise FFN (dinner $=$ 2048), GELU
	Add, LayerNorm, Dropout ( $p=$ 0.1)
Distilling	1 $\times$ 3 conv1d, ELU
	Max pooling (stride $=$ 2)
Recurrent neural network	bidirectional_dynamic_rnn ( $n=$ 1024, BasiRNNCell)
	bidirectional_dynamic_rnn ( $n=$ 1024, BasiRNNCell)
Fixed States decoder	tf.nn.dynamic_rnn( $n=$ 2048,BasiRNNCell)

Figure 5.

A detailed introduction to the Multiple keypoint-Distilling-attention architecture, The red squares represent Screening Self-attention, and the green squares represent keypoint Distilling.

4. Experimental results and datasets

4.1 Implementation details

Inspired by references [37, 38, 39, 40, 41, 42, 43], our data preprocessing method is also related to view invariance, action sequences are captured from different views by depth cameras, e.g., Microsoft Kinect. $3\text{D}$ human joint positions are extracted from a single depth image by a real-time human skeleton tracking framework [44]. We align the action sequences by implementing a view-invariant transformation which transforms keypoints coordinates from original coordinate system into a view-invariant coordinate system $X^{V}\rightarrow X$ . The transformed skeleton joint coordinates are given by

$\displaystyle x_{t}^{h}=R^{-1}\left(x_{t}^{h}-d_{R}\right),\forall h\in H,% \forall t\in T,$

where $x_{t}^{h}\in R^{3\times 1}$ are the coordinates of the $h$ -th joint of the $t$ -th frame, $R$ is the rotation matrix and $d_{R}$ is the origin of rotation. We cite reference [12] for these formulas. All body keypoint sequences are down-sampled to have at most 50 frames, and the coordinates are also normalized to the range of $[-1,1]$ . While retaining the original information, it reduces the calculation amount and improves the calculation efficiency. The pre-processed data is limited to a certain range, and the adverse effects caused by the singular sample data are eliminated. Using the hyper-parameter search, we set the following architecture: Encoder: We firstly preserve the local context By using a fixed position embedding, and, The 1-D convolutional filters (kernel width $=$ 3, stride $=$ 1) are performed on the original input data, It maps the input data from $C_{in}(=75)$ to $d_{\text{model }}(=512)$ , Finally, the final embedding $\mathcal{X}_{\text{feed }[i]}^{t}$ is obtained by adding up the two parts of the embedding, Take $\mathcal{X}_{\text{feed }[i]}^{t}$ as input to the Attention Block (orange block). This representation then passes through multiple Attention blocks, each with multiple Screening self-attentions. The output of each block is followed by a 1-D convolutional filter (kernel width $=$ 3) on time dimension with the ELU activation function and a max-pooling layer with stride 2, And that’s going to reduce the representation of the first block from $L$ to $L/2$ . After repeated operations, the final shrunken representation is Feature0. Feature0 entered the network (2-Layer-Bi-BasicRNN with $N=$ 1024 units in each layer) and turned into encoder final state $F_{Ti}=\left\{\overrightarrow{F_{T}},\overleftarrow{F_{T}}\right\}$ . The operation in space flow is similar to that in time flow. The sum of the $F_{Ti}$ of the two paths $F_{T}$ is the final Feature, Decoder: 1-Layer Uni-BasicRNN with $N=$ 2048 units such that it is compatible with the dimensions of the encoder final state $F_{T}$ . In all BasicRNNs, parameters are initialized with random uniform distribution.

4.2 Datasets

We use three different datasets for training, evaluating, and comparing our system with related approaches.

NTU RGB+D 60[45]: The data set contains 60 classes, with a total of 56880 samples, of which 40 are daily movements, 9 are health-related movements, and 11 pairs mutual movements. They were performed by 40 people aged from 10 to 35. The data set was captured by Microsoft’s Kinectv2 sensor using three different camera angles in the form of depth information, 3D skeleton information, RGB frames, and infrared sequences. The NTU dataset uses two benchmarks when dividing the training and test sets. Cross-subject divides the training set and test set by person ID. The training set contains 40,320 samples, and the test set contains 16,560 samples. Cross-View divides the training set and the test set by the camera. The samples collected by camera 1 are used as the test set, and cameras 2 and 3 are used as the training set. The sample numbers are 18,960 and 37,920, respectively. We test our method on both cross-view and cross-subject protocols.

NTU RGB+D 120[46]: This dataset extends NTU RGB+D 60, a total of 113,945 samples over 120 classes performed by 106 volunteers and captured with 32 different camera setups. Like the NTU RGB+D 60 data set, it adopts the SsssCcccPpppRrrrAaaa format (for example, S001C001P001R001A003.skeleton ), where the last three digits of S represent the setting number (the angle of the camera setting), the last three digits of C represent the camera ID, and the last three digits of P represent Action executor (person) ID, the last three digits of R represent the number of repetitions of the action (repeat 1 or 2 times), the last three digits of A represent the action label, There are a large number of samples with missing bones in the data, and we deleted them as required during training and testing. The original paper on this dataset recommends two benchmarks: (1) the cross-subject (X-Sub) benchmark and (2) the cross-setup (X-Setup) benchmark, on which we test our method.

Multiview Activity II (UWA3D)[47]: dataset contains 30 human actions performed 4 times by 10 subjects. 15 joints are recorded, and each action is observed from four views: frontal, left and right sides, and top. The dataset is challenging due to many views and the resulting self-occlusions from considering only parts of them. In addition, there is a high similarity among actions, e.g., the two actions “drinking” and “phone answering” have many key points being nearly identical, and in the dynamic key points, there are subtle differences. We tested our approach on one of these perspectives.

4.3 Evaluation

We use a K-neighbors (KNN) classifier to evaluate our method of action recognition task. Specifically, we apply the KNN classifier (with $k=$ 1) to the features of the trained network on all sequences in the training set to assign classes. We then use cosine similarity as the distance metric to perform recognition, i.e., place each tested sequence in a class. Notably, the KNN classifier does not require learning extra weights for action placement.

4.4 Accuracy comparison

We compare the performance of the NTU RGB+D 60, NTU RGB+D 120, and UW A3D datasets with previous skeleton-based unsupervised action recognition methods. The results of these three comparisons are shown in Tables 2, Table 3, and Table 4, respectively. The four benchmarks of our method on the NTU RGB+D 60 and NTU RGB+D 120 datasets are far superior to other methods, reaching the state-of-the-art. Especially in the X-View benchmark, our method achieves an accuracy of 0.884, demonstrating the accuracy of our method. The performance of our method on the UW A3D dataset is only 0.012, different from the current state-of-the-art method. We think the main reason is that the capacity of the UWA3D dataset is too small, significantly limiting what our model can extract during training. Secondly, the number of samples in the data set is too small to cause large errors, and the skeleton key points of the UWA3D data set are only 15, which has a great impact on our extraction of spatial features. Even so, we have surpassed other methods, second only to RGCA [48].

Table 2
Comparison with unsupervised learning methods on NTU RGB+D 60 dataset

Methods	X-Sub(%)	X-View(%)
LongT GAN [4]	39.1	48.1
$\textit{MS}^{2}L$ [15]	52.6	/
P&C Rand [12]	39.6	56.4
P&C [12]	50.7	76.3
CSS [16]	52.3	62.1
CAE+ [9]	58.5	64.8
Hierarchical Transformer [14]	69.3	72.8
RGCA [49]	54.4	79.2
GLTA-GCN [48]	61.2	81.2
Ours	82.7	88.4

Table 3

Comparison with unsupervised learning methods on NTU RGB+D 120 dataset

Methods	X-Sub(%)	X-Setup(%)
P&C [12]	42.7	41.7
CSS [16]	44.3	43.5
CAE+ [9]	48.6	49.2
GLTA-GCN [48]	49.1	51.1
Ours	60.8	65.9

Table 4

Comparison with unsupervised learning methods on UWA3D dataset

Methods	V3(%)	V4(%)
CAE+ [9]	25.1	22.8
LongT GAN [4]	53.4	59.9
P&C [12]	59.9	63.1
RGCA [49]	60.7	65.5
Ours	59.9	/

4.5 Dual stream visualization results

t-SNE visualization results: We selected all the categories of the three datasets for t-SNE visualization. The embedding space dimension we set was 2, and the PCA initialization method was used. The random seed value was 42. The results are shown in Fig. 6. Obviously, the visualization results of the NTU60 data set are awe-inspiring, which verifies our super high accuracy rate. Other data sets have not achieved such impressive results, which is related to their own accuracy rate, but they are basically clustered successfully.

Figure 6.

$t$ -SNE visualization of learned features on the three datasets (from left to right): NTU RGB+D 60 (60 actions); NTU RGB+D 120 (120 actions); UWA3D (30 actions).

Confusion matrix visualization results: We generated the confusion matrix diagrams of the three data sets, and the results are shown in Fig. 7. The visualization results of the NTU60 data set are better. For the NTU120 data set, we found that various actions of a single person can basically be recognized. However, the recognition effect of two-person interactive motion could be better. The reason is that we have adopted the same method as single-person action in the feature extraction of two-person interactive movement, and we have not been able to specially deal with the relationship between a total of 50 key points between two people. , this is where we need to improve in future work. In the UWA3D dataset, we found that the distinction between similar actions is poor, such as drinking and making a phone call, walking, and walking irregularly. As mentioned in the dataset introduction, the dataset is challenging, and there is a high similarity between actions.

Figure 7.

Confusion matrices for testing MD-STA performance on the three datasets (from left to right): NTU RGB+D 60 (60 actions); NTU RGB+D 120 (120 actions); UWA3D (30 actions).

4.6 Time branch visualization results

t-SNE visualization results: We selected all categories of the three data sets respectively for t-SNE visualization on the time branch. The embedded space dimension, initialization method, and random seed we selected were consistent with the double path, and the results are shown in Fig. 8. In the NTU60 data set, it was obvious that the effect of the single-path time dimension was inferior to that of the double-path.

Figure 8.

t-SNE visualization of learned features on the three datasets (from left to right): NTU RGB+D 60 (60 actions); NTU RGB+D 120 (120 actions); UWA3D (30 actions).

Confusion matrix visualization results: We generated the confusion matrix diagram of the three data sets in the time branch, and the results are shown in Fig. 9.

Table 5

Model complexity

Methods	Training
	Time	Memory
Transformer	$\mathcal{O}\left(L^{2}\right)$	$\mathcal{O}\left(L^{2}\right)$
LogTrans	$\mathcal{O}(L\log L)$	$\mathcal{O}\left(L^{2}\right)$
MD-TA	$\mathcal{O}(L\log L)$	$\mathcal{O}(L\log L)$

Figure 9.

Confusion matrices for testing MD-STA performance on the three datasets (from left to right): NTU RGB+D 60 (60 actions); NTU RGB+D 120 (120 actions); UWA3D (30 actions).

Analysis of time and space complexity visualization results: In order to verify the reduction effect of the module (SAD) on the amount of computation and the complexity of time and space, we generated graphs of the time spent on each round of training for three data sets, processor: Intel(R) Core(TM) i7-10700 CPU@2.90GHz(16 CPUs), 2.9 GHz, chip type: NVIDIA GeForce RTX 2060. The results are shown in Fig. 10. Compared with other methods without a (SAD) module. Our method leads by a large margin in terms of computational speed. And compared with other models in terms of time and space complexity, our model has a lower complexity, as shown in Table 5. Thus, it is proved that our module (SAD) reduces the computational load of the model and reduces the time and space complexity.

Figure 10.

Chart of training time per round on the three datasets (from left to right): NTU RGB+D 60 (60 actions); NTU RGB+D 120 (120 actions); UWA3D (30 actions).

4.7 Ablation studies

To verify the effectiveness of each component in our proposed framework, we conduct ablation studies on NTU-60. All the experiments are conducted in the context of the skeleton-based action recognition downstream task.

Effectiveness of spatio-temporal fusion features: We conducted experiments on different branches, and the experimental results are shown in Table 6 and Fig. 11. The effect of the time branch exceeds that of the space branch, and the effect of the double branch exceeds that of any other branch. The results show that the two-flow network structure of our model is necessary.

Table 6
Comparisons on NTU RGB+D 60 dataset on different model flows

Methods	X-Sub(%)	X-View(%)
MD-SA	72.4	80.5
MD-TA	76.7	86.0
MD-STA	82.7	88.4

Table 7

Comparisons on NTU RGB+D 60 dataset on different model structures

	Screening attention	Disstiling	Dim-reduction fusion	X-Sub(%)	X-view(%)
EPG	$\checkmark$	–	–	48.4	52.1
EPDG	$\checkmark$	$\checkmark$	–	72.0	78.6
MD-TA	$\checkmark$	$\checkmark$	$\checkmark$	76.7	86.0

Figure 11.

Confusion matrices for testing MD-STA performance on the different branches (from left to right): MD-SA; MD-TA; MD-STA.

The effectiveness of each module: To avoid the influence of spatiotemporal fusion features on each module, we conducted experiments on different model structures in a single time branch, and the experimental results are shown in Table 7 and Fig. 12. The effect of structural EPDG exceeds that of structural EPG, proving that our Disstiling module is necessary, while the effect of structural MD-TA exceeds all other structures, proving that the use of Dim-reduction Fusion(DRF) modules is also required. The results show that each module of our model is a necessary condition for action recognition tasks.

The effectiveness of different attentions: Table 8 illustrates the performance of the model when different attention mechanisms are used in the time branch. The accuracy of the model using ScreenAttention exceeds that of the model using FullAttention, but the gap is not noticeable enough. The gap is not obvious enough because Full attention calculates the relationship between all sequence frames and compensates for its accuracy with a high amount of calculation. The results show that modeling with ScreenAttention is effective.

Table 8

Comparisons on NTU RGB+D 60 dataset with different attentions

Method	X-Sub(%)	X-View(%)
FullAttention	75.6	84.1
ScreenAttention	76.7	86.0

The effectiveness of different layers: To evaluate the effect of different RNN layers, we compare the models with different bidirectional RNN layers. From Table 9, we can clearly see that too many bidirectional RNN layers will lead to a loss of accuracy. Our model achieves the best results when the number of layers of the bidirectional neural network is set to 2. We infer that this is because of the gap between the pre-training task and the final action recognition task. Too many bidirectional RNNs will lead to vanishing gradients, thereby reducing the performance of the action recognition task. And the effect of single-layer bidirectional RNN is even worse. Only one layer of RNN neural units is not enough, and the feature extraction ability is not enough. The results show that modeling with a two-layer bidirectional RNN is effective.

Table 9

Comparisons on NTU RGB+D 60 dataset with different layers in Dim-reduction Fusion(DRF) module

Layers	X-Sub(%)	X-View(%)
1layer bi-basicRNN	59.2	65.7
2layer bi-basicRNN	76.7	86.0
3layer bi-basicRNN	79.2	85.7

Table 10

Comparisons on NTU RGB+D 60 dataset with different distillation amplitude and distillation times

Sequence length	Primary distillation	Double distillation
L/9	55.0	55.2
L/4	70.1	82.7
L/3	58.5	/
L/2	47.3	/

Figure 12.

t-SNE visualization of learned features on NTU RGB+D 60 for three structure (from left to right), EPG; EPDG; MD-STA.

The effectiveness of distilling intensity: To further explore the effects of distilling intensity on the final performance, we conduct ablation experiments with different distillation amplitudes and distillation times. The recognition results are shown in Table 10. It can be easily observed that the best performance is achieved when the sequence length is set to L/4. We attribute this to higher distillation intensity leading to the loss of effective features, while less distillation benefits the redundancy of invalid features, and neither of them can better characterize the skeleton action sequence.

5. Conclusion

In this paper, we propose a new bone-based unsupervised model for motion recognition. Compared with the previous approach, our system achieves better performance because the new training strategy has a powerful encoder to learn more efficient spatiotemporal fusion features, a slightly deleted sequence, and a weak decoder to enhance the training of the encoder so that the network learns more separable representations. Experimental results show that our unsupervised model can effectively learn different action features on the three benchmark data sets, and is better than the previous unsupervised method.

In this paper, our method failed to extract the hierarchical time features of bone sequences. The sequence with length L could be divided into several segments to extract the connections between each segment, or the spatial key points could be fused into several parts of the body to extract its geometric spatial features combined with human osteology. These two features gradually become important features of human 3D bone recognition. In future work, hierarchical time features will be added into the Embedding layer, and a module with the ability to extract geometric space features will be added into the Screening Self-attention layer combined with the human skeleton so that the model will achieve better results.

Declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under grant 61771322 and the Shenzhen Science and Technology Program under Grant JCYJ20220531100814033.

References

Gao

Zhang

, I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 8303–8311.

Yang

Chen

Dong

, W2vv++ fully deep learning for ad-hoc video search, in: Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 1786–1794.

Dong

Yang

Wang

, Dual encoding for video retrieval by text, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(8) (2021), 4065–4080.

Zheng

Wen

Liu

Long

Dai

Gong

, Unsupervised representation learning with long-term dynamics for skeleton based action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.

Nie

Liu

, Unsupervised 3d human pose representation with viewpoint and pose disentanglement, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, Springer, 2020, pp. 102–118.

Ahn

Kim

Hong

B.C.

, STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3330–3339.

Yang

Liu

M.H.

Kot

A.C.

, Skeleton cloud colorization for unsupervised 3d action representation learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13423–13433.

Kim

Chang

H.J.

Kim

Choi

J.Y.

, Global-local motion transformer for unsupervised skeleton-based action learning, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer, 2022, pp. 209–225.

Rao

Cheng

, Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition, Information Sciences 569 (2021), 90–109.

10.

Thoker

F.M.

Doughty

Snoek

C.G.

, Skeleton-contrastive 3D action representation learning, in: Proceedings of the 29th ACM international conference on multimedia, 2021, pp. 1655–1663.

11.

Guo

Liu

Chen

Liu

Wang

Ding

, Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 762–770.

12.

Liu

Shlizerman

, Predict & cluster: Unsupervised skeleton based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9631–9640.

13.

Lin

, Self-supervised 3d skeleton action representation learning with motion consistency and continuity, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13328–13338.

14.

Chen

Zhao

Yuan

Tian

Xia

Geng

Han

Metaxas

D.N.

, Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, Springer, 2022, pp. 185–202.

15.

Lin

Song

Yang

Liu

, Ms2l: Multi-task self-supervised learning for skeleton based action recognition, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2490–2498.

16.

Gao

Yang

, Contrastive self-supervised learning for skeleton action recognition, in: NeurIPS 2020 Workshop on Pre-registration in Machine Learning, PMLR, 2021, pp. 51–61.

17.

Cheriet

Dentamaro

Hamdan

Impedovo

Pirlo

, Multi-Speed Transformer Network for Neurodegenerative disease assessment and activity recognition, Computer Methods and Programs in Biomedicine (2023), 107344.

18.

Ding

Zhang

, Skeleton-based human motion prediction via spatio and position encoding transformer network, in: International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2022), Vol. 12588, SPIE, 2023, pp. 186–191.

19.

Gedamu

Gao

Yang

Shen

H.T.

, Relation-mining self-attention network for skeleton-based human action recognition, Pattern Recognition 139 (2023), 109455.

20.

Bahdanau

Cho

Bengio

, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, 2014.

21.

Luong

M.-T.

Pham

Manning

C.D.

, Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025, 2015.

22.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

Polosukhin

, Attention is all you need, Advances in neural information processing systems 30 2017.

23.

Shou

Zareian

Mansour

Vetro

Chang

S.-F.

, CDSA: cross-dimensional self-attention for multivariate, geo-tagged time series imputation, arXiv preprint arXiv:1905.09904, 2019.

24.

Child

Gray

Radford

Sutskever

, Generating long sequences with sparse transformers, arXiv preprint arXiv:1904.10509, 2019.

25.

Jing

Wang

Tan

, Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network, Pattern Recognition 107 (2020), 107511.

26.

Liu

Chen

, Enhanced skeleton visualization for view invariant human action recognition, Pattern Recognition 68 (2017), 346–362.

27.

Zhang

Lan

Xing

Zeng

Xue

Zheng

, View adaptive neural networks for high performance skeleton-based human action recognition, IEEE transactions on pattern analysis and machine intelligence 41(8) (2019), 1963–1978.

28.

Liu

Wang

Duan

L.-Y.

Abdiyeva

Kot

A.C.

, Skeleton-based human action recognition with global context-aware attention LSTM networks, IEEE Transactions on Image Processing 27(4) (2017), 1586–1599.

29.

Dedeoğlu

Töreyin

B.U.

Güdükbay

Çetin

A.E.

, Silhouette-based method for object classification and human action recognition in video, in: Computer Vision in Human-Computer Interaction: ECCV 2006 Workshop on HCI, Graz, Austria, May 13, 2006. Proceedings 9, Springer, 2006, pp. 64–77.

30.

Cheng

Zhang

Chen

Cheng

, Skeleton-based action recognition with shift graph convolutional network, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192.

31.

Peng

Hong

Chen

Zhao

, Learning graph convolutional network for skeleton-based human action recognition by neural searching, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 34, 2020, pp. 2669–2676.

32.

Liu

Zhang

Chen

Wang

Ouyang

, Disentangling and unifying graph convolutions for skeleton-based action recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 143–152.

33.

Tsai

Y.-H.H.

Bai

Yamada

Morency

L.-P.

Salakhutdinov

, Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, arXiv preprint arXiv:1908.11775, 2019.

34.

Koltun

Funkhouser

, Dilated residual networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 472–480.

35.

Gupta

Rush

A.M.

, Dilated convolutions for modeling long-distance genomic dependencies, arXiv preprint arXiv:1710.01278, 2017.

36.

Clevert

D.-A.

Unterthiner

Hochreiter

, Fast and accurate deep network learning by exponential linear units (elus), arXiv preprint arXiv:1511.07289, 2015.

37.

Xia

Chen

C.-C.

Aggarwal

J.K.

, View invariant human action recognition using histograms of 3d joints, in: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, 2012, pp. 20–27.

38.

Vemulapalli

Arrate

Chellappa

, Human action recognition by representing 3d skeletons as points in a lie group, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588–595.

39.

Wang

, Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110–1118.

40.

Zhu

Lan

Xing

Zeng

Shen

Xie

, Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 30, 2016.

41.

Jiang

Kong

Bebis

Huo

, Informative joints based human action recognition using skeleton contexts, Signal Processing: Image Communication 33 (2015), 29–40.

42.

Liu

Shahroudy

Wang

, Spatio-temporal lstm with trust gates for 3d human action recognition, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14, Springer, 2016, pp. 816–833.

43.

Song

Lan

Xing

Zeng

Liu

, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 31, 2017.

44.

Shotton

Fitzgibbon

Cook

Sharp

Finocchio

Moore

Kipman

Blake

, Real-time human pose recognition in parts from single depth images, in: CVPR 2011, Ieee, 2011, pp. 1297–1304.

45.

Shahroudy

Liu

T.-T.

Wang

, Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.

46.

Liu

Shahroudy

Perez

Wang

Duan

L.-Y.

Kot

A.C.

, Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding, IEEE transactions on pattern analysis and machine intelligence 42(10) (2019), 2684–2701.

47.

Rahmani

Mahmood

Q Huynh

Mian

, HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, Springer, 2014, pp. 742–757.

48.

Qiu

Duan

Jin

, GLTA-GCN: Global-Local Temporal Attention Graph Convolutional Network for Unsupervised Skeleton-Based Action Recognition, in: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2022, pp. 1–6.

49.

Yao

Zhao

S.-J.

Xie

Liang

, Recurrent graph convolutional autoencoder for unsupervised skeleton-based action recognition, in: 2021 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2021, pp. 1–6.

50.

Hanke

Knees

, A phase-field damage model based on evolving microstructure, Asymptotic Analysis 101 (2017), 149–180.

51.

Lefever

, A hybrid approach to domain-independent taxonomy learning, Applied Ontology 11(3) (2016), 255–278.

52.

Meltzer

P.S.

Kallioniemi

Trent

J.M.

, Chromosome alterations in human solid tumors, in: The Genetic Basis of Human Cancer Vogelstein

Kinzler

K.W.

, eds, McGraw-Hill, New York, 2002, pp. 93–113.

53.

Murray

P.R.

Rosenthal

K.S.

Kobayashi

G.S.

Pfaller

M.A.

, Medical Microbiology, 4th edn, Mosby, St. Louis, 2002.

54.

Wilson

, Active vibration analysis of thin-walled beams, PhD thesis, University of Virginia, 1991.

55.

Shahroudy

Liu

T.-T.

Wang

, Ntu rgb+ d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.

Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition

Abstract

Keywords

1. Introduction

2.1 Unsupervised action recognition

2.2 Attention mechanism

2.3 Extraction of temporal and spatial features

3. Method

4.1 Implementation details

4.2 Datasets

4.3 Evaluation

4.4 Accuracy comparison

Table 2 Comparison with unsupervised learning methods on NTU RGB+D 60 dataset

Table 6 Comparisons on NTU RGB+D 60 dataset on different model flows

Declarations

Data availability

Footnotes

Acknowledgments

References

Table 2
Comparison with unsupervised learning methods on NTU RGB+D 60 dataset

Table 6
Comparisons on NTU RGB+D 60 dataset on different model flows