Abstract
This paper introduces a method for identifying human actions in depth action videos. We first generate the corresponding Motion History Images (MHIs) and Static History Images (SHIs) to an action video by utilizing the so-called 3D Motion Trail Model (3DMTM). We then extract the Gradient Local Auto-Correlations (GLAC) features from the MHIs as well as SHIs to characterize the action video. Next, we concatenate the set of MHIs based GLAC features with the set of SHIs based GLAC features to gain a single action representation vector. Thus, the computed feature vectors in all action samples are passed to the l2-regularized Collaborative Representation Classifier (l2-CRC) for recognizing multiple human actions effectively. Experimental evaluations on three action datasets, MSR-Action3D, DHA and UTD-MHAD, reveal that the proposed recognition system attains superiority over the state-of-the-art approaches considerably. In addition, the computational efficiency test indicates the real-time compatibility of the system.
Keywords
Introduction
Human Action Recognition (HAR) is one of the most challenging research domains of computer vision. It is widely used in video surveillance system, health care monitoring system, video analysis, assistive living, robotics, telemedicine, video analysis, content-based video searching, video-gaming, human-computer interaction [1 –4] and a diversity of schemes that includes interfaces among individuals and electrical devices. In fact, human action recognition is still a complicated issue for having variations in human body sizes and motions. In addition, the complexity rises when the similar action is accomplished in a different way by multiple subjects, even for similar subject at variation in time.
Traditionally, research on human action recognition has been focusing through learning and classifying actions in action videos, which is recorded with conservative RGB sensors [5 –8]. In practice, there are intrinsic defects associated with this kind of data sensor for action recognition, such as sensitivity to illumination changes, occlusions, and background clutters [1]. Again, the object texture plays an important role for identifying object’s significant points rather than object’s geometric structure [9]. Furthermore, human actions captured by the conventional RGB sensors cannot encode the 3D action information directly.
However, through the presence of the inexpensive depth sensors (e.g., Microsoft Kinect), the above difficulties has been eliminated significantly in action classification task. The sensor delivers depth images/maps which preserve difference between surface of a scene object and the sensor’s viewpoint [10, 11]. Moreover, the issues of human localization and segmentation are simpler in depth images than RGB images [12]. Besides, the human skeleton information is gained from depth maps as, supplementary in action classification [13]. Overall, depth sensors exhibit several advantages over the RGB sensors.
The above discussion indicates the depth sensor based action recognition system can achieve superiority over the RGB sensor based systems. Hence, this paper addresses the depth sensor based human action recognition issue. In this paper, we first consider the static and motion posture characteristics to represent a depth action sequence. The GLAC [14] features from the obtained static and motion posture images are then calculated and fused to introduce an effective action recognition method. More specifically, the paper introduces an action recognition approach from depth video sequences. The 3D Motion Trail Model (3DMTM) is utilized to form Motion History Images (MHIs) and Static History Images (SHIs) from the front, side and top projection views of an action video. The GLAC features are calculated and fused with MHIs and SHIs in a complementary way to generate an action feature vector. The resulting feature vector is considered as the input of the l2-regularized Collaborative Representation Classifier (l2-CRC) to recognize the human action. The pipeline of our proposed recognition algorithm is shown in Fig. 1.
The main contributions of this paper are summarized as follows:

Framework of the proposed action recognition system.
We compute MHIs and SHIs corresponding to each depth sequence through the 3DMTM algorithm. The MHIs and SHIs are generated from the front, side and top projection views of an action video.
The GLAC features are extracted from the MHIs and SHIs individually. The generated GLAC features on MHIs and SHIs are concatenated to construct a single feature vector.
The generated feature vector is fed to l2-CRC to recognize action in the video.
The recognition system is extensively evaluated on three publicly available datasets, such as MSRAction3D [15], DHA [16], and UTD-MHAD [17]. We take comparison of the recognition outcome with handcrafted features based methods as well as deep learning methods. Overall experimental assessment indicates that the proposed approach achieves superiority over the aforementioned approaches (i.e., handcrafted feature based methods and deep learning methods).
The rest of the paper is organized as follows. Section 2 reviews the related work. In Section 3, we describe the proposed recognition method consisting of feature extraction strategies. The results obtained are discussed in Section 4. Finally, Section 5 concludes the paper.
Human action recognition from depth video sequences has taken great fascination to the researchers in computer vision due to the emergence of the depth video sensors. In fact, human action recognition research has discovered many feature extraction strategies, e.g., 3D point cloud [15], projected depth images [18], spatio-temporal interest points [19], skeleton joints [20] and etc., to represent depth video sequences. In [15], a collection of 3-dimensioanl points, extracted from depth maps, was sampled to illustrate the 3-dimensional structures of body postures. The Gaussian mixture model (GMM) was utilized for strongly capturing the points’ distribution in the context of statistics. The method could not exhibit promising results for having high computational complexity and for dropping spatial information amongst significant points. In contrast, Vieira and others constructed silhouettes in 3-dimensional Euclidean space through employing the Space-time Occupancy patterns (STOP) [21, 22]. They labeled a cell by 1, 0 and fraction when the cell is filled, unfilled and partially-filled respectively in a cell structured spatio-temporal depth volume. The fully and partially cells are detected through an ad hoc parameter. There is a benefit of using STOP that it conserves spatial and temporal contextual information between space and time cells, which is helpful sufficient to control intra-class variants. In contrast to simple occupancy patterns, a Haar feature vector was utilized in [23] on an even grid in the 4D depth volume. The computational complexity was very high for both of those methods.
Again, a filtering technique for identifying space-time significant points (STIPs) in depth action sequences (called DSTIP) was presented in [19] for emphasizing action consistent significant points with successfully removing noise in depth action videos. From the result of motion energy images (MEI) [20] of motion history images (MHI), depth motion maps (DMMs) [18, 25] were constructed to represent each action video compactly. Besides, the histogram of oriented gradients (HOG) features [24], the local binary pattern (LBP) [27] features and other shape and texture features were extracted from those DMMs to characterize action more accurately [26 , 28–30]. The 2D and 3D auto-correlation features were also captured from DMMs and were fused to enhance the disciminatory power of the recognition system [31, 32]. Also, to enhance the DMMs based recognition system, multi-temporal DMMs were computed and texture features were extracted by [33]. Furthermore, 3D histograms of texture (3DHoTs) were used to capture dominant features from a depth action sequence for human action recognition [34]. Due to failure of capturing the complex joint shape motion cues at a pixel-level from depth image, histogram of oriented 4D normal was used in [35]. In [36], a new framework, by combining salient depth map (SDM) and binary shape map (BSM) feature vectors, was proposed. As another approach, locality-constrained linear coding (LLC) based action recognition algorithm was introduced in [37]. Additionally, an action recognition process by using hierarchical 3D kernel descriptors was proposed in [38].
Skeleton joints can be extracted from depth frames and based on those joints; some action recognition systems have been developed. As an instance, the pairwise differences of 3D joint positions of a subject in a depth frame and the temporal differences corresponding to each depth frame were computed to represent human actions [39, 40]. The histograms of 3D joint locations (HOJ3D) were also engaged to represent actions [41]. Furthermore, to improve the skeleton joint based recognition system, a genetic-based evolutionary algorithm was applied to decide the optimum subgroup of skeleton joints [42]. A non-parametric moving pose (MP) approach for low-latency action identification was reported in [43], which used together pose information and differential quantities (speed and acceleration) of the skeleton joints inside a small temporal block about the working frame. Again, a local skeleton descriptor was used in [44] which encoded the relative position of the joint quadruples of human skeleton. In [45], another joint representation and recognition model was described by combining multi-perspective and multi-modality projections for color and depth frame sequence. An actionlet ensemble model for action classification was presented in [46]. The proposed model based recognition system was robust to noise. The skeletal representation based on 3-dimensional geometric associations among different body segments was discussed in [47]. In fact, this work represented human actions as curves in Lie group.
For bringing diversity in joint based methods, 3D joint features were combined with color and depth sequence based features as reported in [48, 49].
To further improve the entire human action recognition, a fusion framework through two differing modality sensors formed by a depth sensor (a Kinect sensor) and a wearable inertial sensor (accelerometer) was proposed in [1].
Beyond the handcrafted features based methods, deep learning methods characterize the action from raw action data and properly compute the extreme level semantic action representation as in other domains (e.g., speech [63, 64]). In [50], Wang et al. introduced a deep model, which exhibited superior performance in action classification in [34]. The DMM-Pyramid based deep architecture was also obtained promising outcome in depth action classification [51].
From the comprehensive survey on depth image oriented action recognition, we have been motivated to develop the action recognition system through depth images that essentially preserve abundant discriminative information. In this paper, we have mainly focused on dominant feature extraction and action representation. More precisely, this paper emphasizes on the motion and static postures to represent an action, whereas the previously reported methods consider the motion posture only. Indeed, the motion posture images could be unsuccessful to assure capturing utmost of the moving portions with inappropriate employment of the motion posture update function. Additionally, the knowledge about the motionless pose history domains, monotonous activities and monotonous unmoving poses, is discounted in the motion posture frames. Thus, the motion and static posture images are essential simultaneously to address the inter-class similarity and intra-class variation issues. Sometimes, the motion posture images do not contain enough information or maximize the intra-class variations among subjects of same action while the analogous static body postures of those subjects can help to minimize it. Similarly, the motion posture images can increase the inter-class similarity due to subject’s moving fashion where the motionless pose parts could decrease it. Overall, our proposed method addresses the inter-class similarity and the intra-class variation problems in recognizing human actions in depth action sequences. Besides effectiveness, our plan is to test the computational efficiency of the algorithm to implement it in real-time operation.
Proposed recognition system
In this section, we present our approach by a comprehensive discussion on feature extraction, action representation and classification techniques.
Feature extraction
To encode the action features, the MHI and SHI are first derived from a depth action sequence and then the GLAC features are exploited from the resulting images. We discuss this in the following text. However, the well-known Motion History Image (MHI) gathers the description of body-segment movements (presented in depth frames) by contracting the depth sequence to a gray-level picture [20]. But, with inappropriate employment of the update function, the MHI can be unsuccessful to assure capturing utmost of the moving portions. Additionally, the knowledge about the motionless pose history domains, monotonous activities and monotonous unmoving poses is discounted in the MHI pattern [53]. Thus, we consider the Static History Image (SHI) to capture the aforementioned complementary components of the MHI. To generate the MHI and SHI images for each action video, we work with the 3-dimensional Motion Trail Model (3DMTM) [53]. The model provides a set of motion history images {MH1
XOY
, MH1
YOZ
, MH1
XOZ
} and a set of static posture history images {SH1
XOY
, SH1
YOZ
, SH1
XOZ
} corresponding to three 2D Euclidean planes. The motion update function φ
M
(x, y, t) and static posture update function φ
S
(x, y, t) are used in specifying the domains of moving and unmoving attitudes of subjects per activity completing respectively. These two functions are utilized for each depth frame investigated in the depth action video:
where (x, y) represents pixel’s location and t represents amount of duration. Furthermore, d
t
= (d
1, d
2, d
3, … … … , d
T
) is a depth map sequence whereas P
t
= (P
1, P
2, P
3, … … … , P
T
) holds for a difference picture arrangement representing the absolute dissimilarity for a pair of depth maps. Moreover, both of these update functions want two threshold values; these are ς
M
and ς
S
for representing motion and stationary knowledge between sequential depth maps. So, the depth motion history image F
m
(x, y, t) is gained through employing motion update function φ
M
(x, y, t):
Furthermore, static posture history image (SHI) F
S
(x, y, t) is created by the static posture update function φ
S
(x, y, t) to reward for motionless domains for the entire depth video. This could be achieved by analogous technique like MHI:
It should be noted that the 3DMTM also provides the average motion history image (AMHI) and average static posture history image (SHI). In this work, we only consider the MHI and SHI as the AMHI and ASHI reduce the recognition accuracy [54]. Next, we extract gradient and curvature features (which are properties of human contours) from the MHI and SHI images using Gradient Local Auto-Correlation (GLAC) feature descriptor [14]. For an intuitive explanation, let us consider the I (x, y) as a MHI/SHI. Then the magnitudes of the gradient vectors and the relevant orientation angles at each point of I (x, y) are given by
where
In K ∈ {0, 1}, the GLAC features are written as follows:
Since there are 4 individual configurations of (boldr, boldr + bolda
1) as depicted by Fig. 2, the above GLAC feature dimension (boldF
0 and boldF
1) is configured by d = D + 4D
2. Thus the obtained d-dimensional feature vector corresponding to the MHI
XOY
is denoted by

Configuration patterns for K ∈ {0, 1}.
From above, three feature vectors
Action classification
The l 2 -regularized Collaborative Representation Classifier ( l 2 -CRC) has been used successfully to recognize human actions in [25, 28]. As a result, after representing an action by combining the
Here
where
In accordance with [60] vector
After that by utilizing the class information of all the learning instances,
Where q
j
is defined by
Algorithm 1 describes our recognition system concisely.
The recognition approach is extensively assessed including comparison with state-of-the-art approaches on the three benchmark datasets, i.e., MSRAction3D, DHA [16], and UTD-MHAD [45].
Results on the MSRAction3D Dataset
In the first experimental setup, the 20 actions are split into three different human action sets that are represented by Table 1. Three different test cases such as test one, test two and cross subject test are conducted for every action subset [15, 18]. In all the experiments, we set the optimal parameters D = 8 and Δr = 1 for the MHI and SHI based GLAC descriptors by utilizing the training examples through 5-fold cross validation technique. The number of spatial bins are also tuned and set to in the same manner. The l 2 -CRC parameter μ is set to 0.0001 through the 5-fold cross validation technique in the range of 0.00001∼10 on the learning samples. To enhance the computational easiness of the algorithm, Principle Component analysis (PCA) is employed to shrink the dimensions of the obtained action vector (in this setting, the dimension of feature vector is 3168). The PCA transform matrix is gained with the training feature set and then implemented to the test feature set. In all the experiments, the principle components that account for 99% of the entire variation are retained. The feature dimension utilized in each test case is reported in the corresponding accuracy table.
Three subsets of the MSR-Action 3D dataset
Three subsets of the MSR-Action 3D dataset
Recognition results on the three action sets in test one
The recognition performances by employing the
Recognition results on the three action sets in Test Two

Class specific accuracies in test one.

Class specific accuracies in test two.
Recognition results on the three action sets in Cross Subject Test

Class specific accuracies in cross subject test.
The performances of the introduced system are also assessed through extensive comparison with other methods that were evaluated on MSR-Action3D dataset by the similar experimental settings. The comparison of the average recognition accuracy (%) for the three test cases is shown in Table 5. The highest recognition result is highlighted through bold face. Clearly, it is noticeable that the recognition outcome of the proposed approach exhibits outstanding performance over all the approaches mentioned in the table. It should be observed that our algorithm attains the state-of-the-art accuracy of
Average accuracy comparison on the MSR Action 3D dataset on the first setting
The experimental settings reported in [12, 22] is utilized here. All of twenty action classes are used beyond making action subsets and samples from one half of the actors (1, 3, 5, 7, 9) are engaged for model building and the samples obtained from rest actors are used for model assessment. The optimal parameters D = 8, Δr = 1, b S = 1 ×3 and μ = 0.0001 are determined in the same manner. The dimension of feature vector is minimized from 4752 to 85 by employing the similar technique as discussed in the first experimental setup. The comparison outcomes of the second experimental are illustrated in Table 6. Notice that we take a comparison of our system with the deep structured learning system described in [51]. The comparative recognition accuracies are all presented on their relevant papers. The recognition outcomes in Table 6 indicate that our system is even superior to the deep learning approach.
Recognition accuracy comparison on the MSR Action 3D dataset on the second setting
Recognition accuracy comparison on the DHA dataset
Recognition accuracy comparison on the DHA dataset
The
Recognition accuracy comparison on the UTD-MHAD dataset
Recognition accuracy comparison on the UTD-MHAD dataset
It is worth mentioning that the comparison table shows our algorithm achieves 5.1% higher recognition accuracy than the best existing algorithm (the accuracy of the indicated algorithm is 84.4%) reported in [34]. To further clarify our algorithm, the confusion matrix is depicted in Fig. 8. In fact, the matrix provides clarification about the class specific accuracy and the misclassification status corresponding to individual action class.
Besides the overall accuracy, we evaluate the performance using the

Confusion matrix on the MSRAction3D dataset on setting 2.

Confusion matrix on the DHA dataset.

Confusion matrix on the UTD-MHAD dataset.
Statistical measures of the proposed method on MSRAction3D dataset
Statistical measures of the proposed method on DHA dataset
Statistical measures of the proposed method on UTD-MHAD dataset
The computational efficiency of the algorithm is measured through the running time of the major components involved in the algorithm and by the computational complexity of the major components. Indeed, the operation time varies from machine to machine, and hence the computational complexity is taken into account to perceive the efficiency of an algorithm.
The proposed approach is tested on CPU platform with an Intel i5-7500 Quad-core CPU @3.41 GHz and a RAM of 16 GB. The processing time of proposed approach depends on the five major components such as 3DMTM based MHI/SHI generation,
Running time (mean±std) of the major components of the algorithm
Running time (mean±std) of the major components of the algorithm
The computational complexity of our method is similar to the method reported in [25], and the complexity is not as high as methods listed in Table 13. Although our method has same complexity with the method described in [25], our approach outperforms the method by 6.8% recognition accuracy with the same experimental setup (setting 1) on MSRACTION3D dataset. The proposed system is computationally more efficient compared to other methods included in Table 13.
Comparison of computational complexity of our method and other methods
In this work, we have introduced an efficacious feature representation strategy through concatenating two sets of features consisting of MHI based GLAC (denoted by
