Abstract
With the advent of cost-efficient depth cameras, many effective feature descriptors have been proposed for action recognition from depth sequences. However, most of them are based on single feature and thus unable to extract the action information comprehensively, e.g., some kinds of feature descriptors can represent the area where the motion occurs while they lack the ability of describing the order in which the action is performed. In this paper, a new feature representation scheme combining different feature descriptors is proposed to capture various aspects of action cues simultaneously. First of all, a depth sequence is divided into a series of sub-sequences using motion energy based spatial-temporal pyramid. For each sub-sequence, on the one hand, the depth motion maps (DMMs) based completed local binary pattern (CLBP) descriptors are calculated through a patch-based strategy. On the other hand, each sub-sequence is partitioned into spatial grids and the polynormals descriptors are obtained for each of the grid sequences. Then, the sparse representation vectors of the DMMs based CLBP and the polynormals are calculated separately. After pooling, the ultimate representation vector of the sample is generated as the input of the classifier. Finally, two different fusion strategies are applied to conduct fusion. Through extensive experiments on two benchmark datasets, the performance of the proposed method is proved better than that of each single feature based recognition method.
Introduction
After decades of exploration, research on video-based human action recognition has achieved considerable results, which have been widely applied in intelligent surveillance [1], auxiliary health care [2], human-computer interaction [3] and robotics [4], etc. Earlier researches focused on recognizing human actions from videos captured by RGB cameras. For instance, based on RGB videos, Bobick et al. [5] used the motion energy image (MEI) and the motion history image (MHI) to represent where the motion occurred. Laptev et al. [6] extended the concept of the spatial interest points to the temporal-spatial domain, thereby generating a compact representation of the action video where the occlusions and the chaotic backgrounds exist too. Sun et al. [7] proposed to model the spatial-temporal context information in a three-level hierarchical way. Di et al. [8] combined both the local and global representations based on silhouette to enhance the discrimination. However, due to the limitations of RGB videos or images, these works suffered from some common problems, such as variation of lighting conditions and cluttered backgrounds. With the development of depth cameras, recent research has focused on using depth sequences for action recognition. Compared with RGB videos, depth sequences have two advantages. First, depth data are more tolerant to changes in lighting conditions; Second, since the depth map records the depth information of the scene, the contrast between the foreground and the background is more distinct, which makes it easier to detect and extract the human body. For example, Chen et al. [9] combined the depth motion maps (DMMs) with the local binary mode (LBP) and tried two fusion strategies, thus achieving the state-of-the-art results on their own dataset. Liu et al. [10] presented a multiple scale energy-based global ternary image (E-GTI) representation to differentiate similar actions and distinguish actions with different speeds. Xu et al. [11] proposed a multilevel frame select sampling (MFSS) method to produce three levels of temporal samples from the depth sequences, and used the proposed motion and static mapping (MSM) method for obtaining the representation of MFSS sequences. For the sake of concisely accumulating the motion information, Shekar et al. [12] extracted the Undecimated Dual Tree Complex Wavelet Transform features from the DMMs.
Most of the methods described above are based on single feature or descriptor for action recognition. Although many features or descriptors can capture motion and shape cues at the same time, they are unable to fully extract action information in the depth sequence. To solve this problem, fusing different features or descriptors is an effective way. In this paper, a recognition scheme that fuses the DMMs based CLBP descriptor and the polynormal descriptor is proposed and the work mainly includes the following: First, the DMMs based CLBP is combined with the patch-based strategy so as to extract more details of the texture information on the DMMs. Second, the DMMs based CLBP descriptor and the polynormal descriptor are fused to represent the characteristics of the action more comprehensively. Both the DMMs based CLBP and the polynormal are capable of simultaneously describing the motion and shape characteristics of the action, but the CLBP focuses on detecting the area where the movement occurs, while the polynormal focuses on describing the process and direction of the movement. Therefore, by integrating the two descriptors, the characteristics of actions can be represented in a comprehensive way. Third, two different fusion strategies are applied to the recognition scheme and their influence on the recognition accuracy is investigated.
The remainder of this paper is organized as follows. In Section 2, the recent researches on action recognition from depth sequence is briefly reviewed. In Section 3, the two descriptors and other feature describing strategies are expounded. In Section 4, the process of sparse representation, pooling and fusion is described. Experiments and evaluation on two benchmark datasets are reported in Section 5. Finally, the conclusion of the work is given in Section 6.
Related work
Depth sequences are not affected by changes in lighting conditions and cluttered backgrounds, but also contain 3D structure information of the action subjects, which has made many researchers divert their attention from RGB videos to depth sequences. For example, Yang et al. [13] introduced the Depth Motion Maps to accumulate global activities through the entire depth sequence, and the Histograms of Oriented Gradients (HOG) was calculated from them as the representation of the action video. Chen, Zhang, et al. [14] extracted the features of 2D gradient autocorrelation from the DMMs and 3D gradient autocorrelation from the original depth sequence, and conducted feature fusion at the decision layer to make use of their complementary advantages. Bulbul et al. [15] extracted the Gradient Local Auto-Correlations (GLAC) features from the Motion History Images (MHIs) and the Static History Images (SHIs) respectively, which are generated through the 3D Motion Trail Model (3DMTM) beforehand. To make it more robust to speed differences between different action performers, Chen et al. [16] created a multi-temporal DMMs representation by extracting the shape and motion cues from clips of the depth sequence at different lengths, and the temporal information between frames was recovered by introducing a weighting function to the depth maps. For the same purpose, Liu et al. [17] proposed an adaptive hierarchical strategy to divide the depth sequence into several sets of subsequences with unequal numbers of frames that compose the subsequences between different levels, with adjacent subsequences of each level overlapping. Furthermore, the length of subsequences and the extent of overlap on the subsequence are both adjustable. Then, they calculated the DMMs on each subsequence and encoded their texture features with the Gabor filter for classification.
The approaches introduced above have not taken the distribution of the 4D points into account, thus potentially losing many discriminating information. The surface normal vectors in the 4D space can effectively describing the shape and motion information in the depth maps. Oreifej and Liu [18] extracted the 4D normal feature in the depth sequence and obtained their distribution in 4D space, i.e. histogram of oriented 4D surface normals (HON4D), by projecting them onto vertices of a regular polychoron. Also using the feature of 4D normal, Yang and Tian [19] concatenated the 4D normal vectors of all points in the neighbourhood of each point in the spatial-temporal sequence, so as to compose a descriptor vector, i.e. polynormal which is then represented through sparse coding. For representing geometrical features extracted from depth motion space, Slama et al. [20] present an original method where sequence features are temporally modeled as subspaces lying on the Grassmann manifold. To capture shape and motion cues from noisy depth data, Liu et al. [21] designed a background modeling method suitable for depth sequences and introduced a two-layer Bag of Visual Words model to separately code the motion-based and the shape-based Spatial-temporal Interest Points.
Theoretical fundamentals and the proposed method
ME based temporal pyramid
The temporal pyramid was first proposed by Laptev et al. [22] They took a rough chronological order which evenly subdivides the depth sequence according to frame indices to form the temporal pyramid. Since then, this method has been widely used in action recognition. For example, Wang et al. [23] used the temporal pyramid to combine cues from the temporal context. However, the speed of the action may vary between different subjects, hence it is not flexible to evenly divided the video according to the frame index. To solve this problem, the adaptive temporal pyramid based on motion energy (ME) [19] is adopted.
Here the modified ME formula is used for faster calculation:
As can be seen from the calculation process of the motion energy, the ME of each frame reflects the amount of accumulated movement of human body when the action reaches that frame. Therefore, compared to the frame number, it is more accurate to partition the depth sequence based on the motion energy, making it more robust to differences in execution speed of the action.
The normalized motion energy vector is evenly divided into several parts and the corresponding frame indices of the split points are the places where the depth sequence is divided. Instead of the 3-level pyramid in [19], a 2-level temporal pyramid, i.e. {T0 - T1, T1 - T2, T2 - T3, T3 - T4, T4 - T5, T5 - T6}, {T0 - T6}, is employed in this work, as shown in Fig. 1. Such a partitioning scheme is adopted in order to partition the video action in a more detailed and matched manner, while avoiding the increase of computational overhead.

The ME based temporal pyramid.
DMMs
In order to represent the shape and motion information of the action in the depth sequence, Yang et al. [13] proposed the depth motion maps, namely DMMs. Specially, they first projected the depth sequence onto three orthogonal Cartesian planes, and thus three projection subsequences were generated, corresponding to the front view (f), the side view (s) and the top view (t) respectively. After that, the difference between the values of the corresponding pixels in each pair of adjacent frames is accumulated based on all the three projection subsequences. For reducing the amount of calculation, Chen et al. [24] modified the process of generating the DMMs, where the DMMs are calculated according to the following formula:

Generation of the DMMs.
Local Binary Pattern (LBP) devised by Ojala et al. [25] is an effective descriptor of images’ texture feature. Given a center pixel t
c
, its neighbouring pixels are equally spaced on a circle with a radius r, with the center t
c
. Let the coordinates of t
c
be (0,0) and m neighbours
As can be seen, the LBP only exploits the sign information of the difference values but loses the magnitude information. However, like the sign, the magnitude contains some other important texture information. Accordingly, Guo et al. [26] introduced the completed local binary pattern (CLBP), where the sign and the magnitude of the differences are separately used to produce their respective binary number, namely CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M). Their calculation process is shown in Fig. 3. The CLBP_S is actually the traditional LBP descriptor and the CLBP_M is defined as follows:

(a) a 3×3 sample block; (b) the difference between each neighbour and the center pixel; (c) the sign; (d) the magnitude.
Additionally, the difference between the center pixel t
c
and the average of all pixels in the map, namely CLBP-Center (CLBP_C) also contains the text information and it is defined as follows:
The calculation of CLBP descriptors is usually conducted on the entire map in present recognition approaches. For the purpose of extracting more detailed texture features, the CLBP is combined with the patch-based strategy [27] in this paper, which is illustrated in Fig. 4.

Illustration of the Patch-based CLBP calculation. s denotes the stride of the patch movement and w represents the width of the patch.
As shown in the figure, the patch-based strategy introduces a sliding window, i.e. the patch on the DMM and calculates the CLBP descriptors based on each patch. The overlap between two adjacent patches is controlled by the stride and a set of patch-based CLBP histograms is generated to describe the corresponding DMMs in more detailed way.
For a depth motion map with the size of M × N, if the width of the patch is set to w, then the total number of patches is (M + s–w)(N + s–w)/s2, and the number of pixels contained in each patch is m = w × w. The CLBP of the ith pixel in each patch is a 3-dimensional vector c
i
= [CLBP _ S, CLBP _ M, CLBP _ C]
T
. The CLBP vector of each pixel in the patch is concatenated to obtain the feature descriptor vector
For the three depth motion maps generated by a depth sequence sample under different view, the CLBP descriptor corresponding to each patch is calculated according to the above-mentioned patch-based strategy for further representation.
The 4D normal
Given a sequence of depth maps {I1, I2, . . . , I
N
}, every pixel in the sequence can be treated as a point in the 4D space and the value of a pixel z on a map can be regarded as a function of the abscissa x, the ordinate y and the frame index t, i.e. z = f(x, y, t). Thus, all the pixels jointly compose a surface S in the 4D space and the normal of point (x, y, t, z) in the 4D surface S is calculated as:
The orientation of the normals in the whole surface reflects the shape of the surface, hence it is an effective feature that embodies the shape and motion information of the action. For further processing, the normal vectors are normalized first.
In order to present the correlation between the normals of adjacent points and make the distribution of normal orientations less affected by noise, Yang et al. [19] gathered the normals of all points in a local spatial-temporal neighbourhood to create a robust descriptor of normal feature, namely polynormal. In other words, the polynormal corresponding to a point on the 4D surface concatenates the normal vectors of all points in the neighbourhood of this point, which can be expressed by the following formula:

Illustration of generating polynormal P ω of Point ω. The left half of the diagram is a short depth sequence. An Nx×Ny×Nt neighbourhood Ω of Point ω is taken first, as shown in the right half of the diagram. Then the corresponding 4D normal n of each point in the neighbourhood are calculated, and all the normals are concatenated to get the polynormal vector P ω to Point ω. In this figure, Nx = Ny = Nt = 3 and thus L = 27.
Because all the action information is concentrated in the area where the action is performed, the pixel values of the background part are invalid for discrimination of the action category. Accordingly, the depth sequence is first cropped through extracting the largest bounding box of the human body [43]. Additionally, this process also helps to reduce the impact of scale differences on recognition results. To further reduce the computational overhead, the cropped depth sequence is partitioned into spatial grids. In this way, combined with the temporal pyramid division in Section 3.1, a depth sequence is partitioned into a set of space-time cells, and on each cell the polynormals are calculated. The bounding box and space-time cells are illustrated in Fig. 6.

Illustration of the bounding box and space-time cells.
The existing multi-feature fusion methods are mainly divided into two categories, namely feature layer fusion and decision layer fusion [44].
Feature layer fusion is the fusion at the stage of feature description or representation, which is before the classification stage. According to the specific positions, it can be further subdivided into the low-level feature fusion and the middle-level feature fusion, which respectively corresponds to the feature description and feature representation stage. Most of them belong to middle-level feature fusion, and one important reason is that low-level fusion easily leads to overfitting.
Decision layer fusion is to fuse at the classification stage. After inputting various feature descriptors or feature representation vectors into a dedicated classifier for classification, the classification results of each classifier are synthesized according to specific fusion rules, thus getting the final classification result. The following introduces one of the most used fusion rules, i.e. the “sum” rule.
If a depth video sample set contains M action categories and N types of feature representation vectors are obtained from a sample in this set, the output of the classifier corresponding to the ith type of feature representation vector, i.e. the probability of the feature vector belonging to the category ω k , is expressed as P (ω k |X i ), then the “sum” rule can be expressed as follows:
If
In general, compared to other methods, the advantages of low-level feature fusion are that there is no need to perform multiple feature representations and multiple classifications, which reduces the calculation overhead and shortens the recognition time. The advantage of the middle-level feature fusion compared to the decision-level fusion is that there is no need to perform multiple classifications, which reduces the time required for classification. The disadvantage of feature layer fusion is that it easily leads to overfitting. The decision layer fusion can avoid overfitting, but the training and testing of multiple classifiers usually increase the hardware overhead and recognition time.
Sparse representation
The descriptor vectors generated in the above steps describe the low-level features of the video action. In order to extract the high-level semantic information, it must be further represented. Here the sparse coding and dictionary learning (SCDL) [28] is used to obtain the sparse representation of the descriptors. The SCDL method has been widely applied in action recognition wherein it has achieved state of the art results [29, 30]. In addition to producing a compact representation vector, it can also remove the noise in the video effectively. Given a set of patch-based CLBP descriptors c = {c1, c1, . . . , c
N
}, its sparse representation can be got by solving:
Since the dimension of the feature descriptor vectors and the corresponding sparse representation vectors are too high and the number is too large, the maximum pooling and average pooling method are used successively to statistically represent the data distribution in the sparse vectors of the subsequences, which is able to create a more compact representation vector while retaining the most discriminating information. In addition, due to the discrepancy in the number of frames between different subsequences obtained through the ME based temporal pyramid division, the number of polynormal descriptor vectors calculated from them is also unequal. This poses an obstacle to the final classification stage because the classifier used in the work requires the input feature representation vectors to have the same dimensions. The maximum time pooling described below can reduce the representation vectors of the same map or frame into a single vector, thereby unifying the dimensions of the representation vectors to be input into the classifier.
For the sparse representation vector α
i
corresponding to the CLBP descriptor c
i
, maximum pooling is applied, as shown in the following formula:
For the sparse representation vector β
i
corresponding to the polynormal descriptor p
i
, the spatial maximum pooling is applied first:
Then the temporal average pooling is performed:
So far, with regard to each subsequence of a video sample, the corresponding CLBP sparse representation vectors
First feature layer fusion is conducted, that is, concatenate the above two to form the total feature representation vector of the subsequence:
Then, the feature representation vectors of each subsequence are concatenated to compose the ultimate representation vector of the sample:
Under this fusion strategy, the pipeline of the proposed recognition method can be illustrated as Fig. 7:

The pipeline of the proposed method under the feature layer fusion.
Decision layer fusion is to separately sparsely represent and pool the two feature descriptors corresponding to each subsequence according to the method described above. Through comparative experiments, the max pooling method is chosen, which has better performance. For each sample, the final representation vectors corresponding to the two feature descriptors are calculated, and they are input into independent SVM classifiers for classification. The “sum” rule is used to obtain the final classification result. The overall process is shown in Fig. 8.

The pipeline of the proposed method under the decision layer fusion.
MSRAction3D dataset
MSRAction3D dataset [37] is one of the most popular datasets of depth sequences. It is comprised of 20 action categories and each action was performed 2 or 3 times by 10 different subjects. This is a challenging data set, because some of these actions are very similar, and different subjects have considerable difference in speed when doing the same action.
In order to unify the sizes of the DMMs generated from different depth sequences, the size of each DMM was set to the maximum of all sizes. Thus, the DMMf, DMMs and DMMt were resized to 245×137, 245×172 and 172×137 respectively. The patch sizes of the DMMf, DMMs, and DMMt were set to 50×28, 50×35 and 35×28 respectively. Accordingly, the strides between two adjacent patches were set to (25, 14), (25, 18), and (18, 14) respectively. In the calculation of the CLBP, a set of relatively optimal values was found, i.e. m = 4 and r = 1 through observing the influences on the recognition accuracy by different values of m and r. For generating polynormals, 4×3 grids were obtained from the bounding box of the human body, and Nx, Ny, Nt were all set to 3. The values of the parameters were assigned based on experience.
For this dataset, 10 rounds of 5-fold cross-validation are used for training and testing according to the action subject. For example, in a certain round, the samples of the action subjects of No. 1, 2, 4, 5, 6, 7, 9, 10 are taken as the training set, and the samples of the action subjects of No. 3, 8 are taken as the test set. For SVM, some parameter values are assigned by 5-fold cross-validation optimization, and others are set by default.
The confusion matrices of the experimental results using the two fusion strategies are shown in Fig. 9(a) and 9(b) respectively.

The confusion matrices of the recognition results on the MSRAction3D dataset. (a) feature layer fusion; (b) decision layer fusion.
The proposed method was compared with several existing methods in Table 1. As can be seen from the table, the recognition accuracy of the proposed method is 1.70%higher than the DMM-LBP-DF which adopts only the patch-based LBP feature. This is because the proposed strategy has extracted richer texture features by using CLBP instead of LBP and has extra described the process of the human body’s movement compared to the DMM-LBP-DF whose feature descriptors only indicated the area where the motion occurs. Likewise, the proposed method outperforms the Super Normal Vector by a rise of 1.61%in recognition rate, since it has captured additional motion distribution information and more prominent shape cues compared with the latter that employs the polynormal descriptor alone.
Recognition accuracy comparison between the proposed method and previous approaches on MSRAction3D dataset
At the same time, our method achieves better results than some state-of-the-art methods such as [49] and [40]. Although the advantages of deep learning methods in the field of action recognition are becoming more and more obvious recently, it can be seen from Table 1 that the recognition rate of our method is higher than the Deep Convolutional Neural Networks method in [40] and [45]. This is mainly because the learning process of the deep learning methods often relies on a sufficiently large training dataset, but the MSRAction3D dataset only contains a small number of samples, on which the deep learning methods are prone to overfitting. However, the methods of manually extracting features and carefully constructing feature descriptors adopted in this paper often do not require a very large dataset, so it is more advantageous when the number of samples is small.
In addition, it can also be seen that the recognition accuracy when using decision layer fusion is 0.25%higher than when using feature layer fusion, which may indicate that the former fusion strategy has better performance, but it is not significant.
MSRGesture3D dataset [42] is another most used dataset for action recognition based on depth maps. It contains 12 kinds of gestures in the American Sign Language and every gesture is performed 3 times by each of the 10 subjects. It is considered challenging due to the problem of self-conclusion [9].
Correspondingly, the sizes of the DMMf, DMMs and DMMt were 254×285, 254×67 and 67×285 respectively. The corresponding patch sizes were set to 50×56, 50×13 and 13×56 respectively, and thus the strides were set to (25, 28), (25, 7), and (7, 28) respectively. On this dataset, the leave-one-subject-out cross validation [41] was applied. The values of other parameters were the same as that in the experiment on MSRAction3D dataset. The confusion matrices of the experimental results are shown in Fig. 10(a) and 10(b).

The confusion matrices of the recognition results on the MSRGesture3D dataset. (a) feature layer fusion; (b) decision layer fusion.
The comparison between the proposed method and the same methods mentioned in Section 5.1 were shown in Table 2.
Recognition accuracy comparison between the proposed method and previous approaches on MSRGesture3D dataset
As can be seen, the proposed method achieves better performance than the DMM-LBP-DF and the Super Normal Vector with a rise of 2.16%and 2.02%in recognition rate, respectively. The reason is the same as the analysis performed in Section 5.2. Similarly, on this dataset, the method based on decision layer fusion is 0.35%higher than the method based on feature layer fusion, which further indicates that the former is a better fusion strategy for the recognition method proposed in this paper. Additionally, our method also shows better performance than some state-of-the-art methods on this dataset.
In this paper, a new scheme for action recognition from depth sequences has been proposed. The DMMs based CLBP descriptor and the polynormal descriptor were fused to capture the shape and motion cues of human actions comprehensively. To cope with the variations in speed of actions of the same kind performed by different subjects, a motion energy based spatial-temporal pyramid was used to subdivide the depth sequence. The CLBP was combined with the patch-based strategy so as to extract more texture features in DMMs. Moreover, the SCDL was used to separately encode the two low-level features to obtain their respective more compact feature representation, while retaining their higher-order statistics. Through spatial max pooling and temporal average pooling, the dimensions of the two different kinds of representation vectors were unified to form the final representation vector. Therefor, the proposed representation framework is also applicable to the fusion of other kinds of features. The SVM was chosen as the classifier and different fusion strategies were used to conduct extensive evaluation on the proposed method based on two public benchmark datasets. Experimental results showed that the proposed method is superior to each single feature based method and most of the existing recognition methods.
Footnotes
Acknowledgments
This work is supported in part by the Science and Technology Plan Project of Hunan Province, China under Grant 2016WK2023, and in part by the Changsha Science and Technology Plan Project under Grant KQ1901139.
