Human action recognition using MHI and SHI based GLAC features and Collaborative Representation Classifier

Abstract

This paper introduces a method for identifying human actions in depth action videos. We first generate the corresponding Motion History Images (MHIs) and Static History Images (SHIs) to an action video by utilizing the so-called 3D Motion Trail Model (3DMTM). We then extract the Gradient Local Auto-Correlations (GLAC) features from the MHIs as well as SHIs to characterize the action video. Next, we concatenate the set of MHIs based GLAC features with the set of SHIs based GLAC features to gain a single action representation vector. Thus, the computed feature vectors in all action samples are passed to the l2-regularized Collaborative Representation Classifier (l2-CRC) for recognizing multiple human actions effectively. Experimental evaluations on three action datasets, MSR-Action3D, DHA and UTD-MHAD, reveal that the proposed recognition system attains superiority over the state-of-the-art approaches considerably. In addition, the computational efficiency test indicates the real-time compatibility of the system.

Keywords

Human action recognition l2-regularized Collaborative Representation Classifier motion history images static history images 3D Motion Trail Model

1 Introduction

Human Action Recognition (HAR) is one of the most challenging research domains of computer vision. It is widely used in video surveillance system, health care monitoring system, video analysis, assistive living, robotics, telemedicine, video analysis, content-based video searching, video-gaming, human-computer interaction [1 –4] and a diversity of schemes that includes interfaces among individuals and electrical devices. In fact, human action recognition is still a complicated issue for having variations in human body sizes and motions. In addition, the complexity rises when the similar action is accomplished in a different way by multiple subjects, even for similar subject at variation in time.

Traditionally, research on human action recognition has been focusing through learning and classifying actions in action videos, which is recorded with conservative RGB sensors [5 –8]. In practice, there are intrinsic defects associated with this kind of data sensor for action recognition, such as sensitivity to illumination changes, occlusions, and background clutters [1]. Again, the object texture plays an important role for identifying object’s significant points rather than object’s geometric structure [9]. Furthermore, human actions captured by the conventional RGB sensors cannot encode the 3D action information directly.

However, through the presence of the inexpensive depth sensors (e.g., Microsoft Kinect), the above difficulties has been eliminated significantly in action classification task. The sensor delivers depth images/maps which preserve difference between surface of a scene object and the sensor’s viewpoint [10, 11]. Moreover, the issues of human localization and segmentation are simpler in depth images than RGB images [12]. Besides, the human skeleton information is gained from depth maps as, supplementary in action classification [13]. Overall, depth sensors exhibit several advantages over the RGB sensors.

The above discussion indicates the depth sensor based action recognition system can achieve superiority over the RGB sensor based systems. Hence, this paper addresses the depth sensor based human action recognition issue. In this paper, we first consider the static and motion posture characteristics to represent a depth action sequence. The GLAC [14] features from the obtained static and motion posture images are then calculated and fused to introduce an effective action recognition method. More specifically, the paper introduces an action recognition approach from depth video sequences. The 3D Motion Trail Model (3DMTM) is utilized to form Motion History Images (MHIs) and Static History Images (SHIs) from the front, side and top projection views of an action video. The GLAC features are calculated and fused with MHIs and SHIs in a complementary way to generate an action feature vector. The resulting feature vector is considered as the input of the l2-regularized Collaborative Representation Classifier (l2-CRC) to recognize the human action. The pipeline of our proposed recognition algorithm is shown in Fig. 1.

The main contributions of this paper are summarized as follows:

Fig.1

Framework of the proposed action recognition system.

We compute MHIs and SHIs corresponding to each depth sequence through the 3DMTM algorithm. The MHIs and SHIs are generated from the front, side and top projection views of an action video.

The GLAC features are extracted from the MHIs and SHIs individually. The generated GLAC features on MHIs and SHIs are concatenated to construct a single feature vector.

The generated feature vector is fed to l2-CRC to recognize action in the video.

The recognition system is extensively evaluated on three publicly available datasets, such as MSRAction3D [15], DHA [16], and UTD-MHAD [17]. We take comparison of the recognition outcome with handcrafted features based methods as well as deep learning methods. Overall experimental assessment indicates that the proposed approach achieves superiority over the aforementioned approaches (i.e., handcrafted feature based methods and deep learning methods).

The rest of the paper is organized as follows. Section 2 reviews the related work. In Section 3, we describe the proposed recognition method consisting of feature extraction strategies. The results obtained are discussed in Section 4. Finally, Section 5 concludes the paper.

2 Related work

Human action recognition from depth video sequences has taken great fascination to the researchers in computer vision due to the emergence of the depth video sensors. In fact, human action recognition research has discovered many feature extraction strategies, e.g., 3D point cloud [15], projected depth images [18], spatio-temporal interest points [19], skeleton joints [20] and etc., to represent depth video sequences. In [15], a collection of 3-dimensioanl points, extracted from depth maps, was sampled to illustrate the 3-dimensional structures of body postures. The Gaussian mixture model (GMM) was utilized for strongly capturing the points’ distribution in the context of statistics. The method could not exhibit promising results for having high computational complexity and for dropping spatial information amongst significant points. In contrast, Vieira and others constructed silhouettes in 3-dimensional Euclidean space through employing the Space-time Occupancy patterns (STOP) [21, 22]. They labeled a cell by 1, 0 and fraction when the cell is filled, unfilled and partially-filled respectively in a cell structured spatio-temporal depth volume. The fully and partially cells are detected through an ad hoc parameter. There is a benefit of using STOP that it conserves spatial and temporal contextual information between space and time cells, which is helpful sufficient to control intra-class variants. In contrast to simple occupancy patterns, a Haar feature vector was utilized in [23] on an even grid in the 4D depth volume. The computational complexity was very high for both of those methods.

Again, a filtering technique for identifying space-time significant points (STIPs) in depth action sequences (called DSTIP) was presented in [19] for emphasizing action consistent significant points with successfully removing noise in depth action videos. From the result of motion energy images (MEI) [20] of motion history images (MHI), depth motion maps (DMMs) [18, 25] were constructed to represent each action video compactly. Besides, the histogram of oriented gradients (HOG) features [24], the local binary pattern (LBP) [27] features and other shape and texture features were extracted from those DMMs to characterize action more accurately [26 , 28–30]. The 2D and 3D auto-correlation features were also captured from DMMs and were fused to enhance the disciminatory power of the recognition system [31, 32]. Also, to enhance the DMMs based recognition system, multi-temporal DMMs were computed and texture features were extracted by [33]. Furthermore, 3D histograms of texture (3DHoTs) were used to capture dominant features from a depth action sequence for human action recognition [34]. Due to failure of capturing the complex joint shape motion cues at a pixel-level from depth image, histogram of oriented 4D normal was used in [35]. In [36], a new framework, by combining salient depth map (SDM) and binary shape map (BSM) feature vectors, was proposed. As another approach, locality-constrained linear coding (LLC) based action recognition algorithm was introduced in [37]. Additionally, an action recognition process by using hierarchical 3D kernel descriptors was proposed in [38].

Skeleton joints can be extracted from depth frames and based on those joints; some action recognition systems have been developed. As an instance, the pairwise differences of 3D joint positions of a subject in a depth frame and the temporal differences corresponding to each depth frame were computed to represent human actions [39, 40]. The histograms of 3D joint locations (HOJ3D) were also engaged to represent actions [41]. Furthermore, to improve the skeleton joint based recognition system, a genetic-based evolutionary algorithm was applied to decide the optimum subgroup of skeleton joints [42]. A non-parametric moving pose (MP) approach for low-latency action identification was reported in [43], which used together pose information and differential quantities (speed and acceleration) of the skeleton joints inside a small temporal block about the working frame. Again, a local skeleton descriptor was used in [44] which encoded the relative position of the joint quadruples of human skeleton. In [45], another joint representation and recognition model was described by combining multi-perspective and multi-modality projections for color and depth frame sequence. An actionlet ensemble model for action classification was presented in [46]. The proposed model based recognition system was robust to noise. The skeletal representation based on 3-dimensional geometric associations among different body segments was discussed in [47]. In fact, this work represented human actions as curves in Lie group.

For bringing diversity in joint based methods, 3D joint features were combined with color and depth sequence based features as reported in [48, 49].

To further improve the entire human action recognition, a fusion framework through two differing modality sensors formed by a depth sensor (a Kinect sensor) and a wearable inertial sensor (accelerometer) was proposed in [1].

Beyond the handcrafted features based methods, deep learning methods characterize the action from raw action data and properly compute the extreme level semantic action representation as in other domains (e.g., speech [63, 64]). In [50], Wang et al. introduced a deep model, which exhibited superior performance in action classification in [34]. The DMM-Pyramid based deep architecture was also obtained promising outcome in depth action classification [51].

From the comprehensive survey on depth image oriented action recognition, we have been motivated to develop the action recognition system through depth images that essentially preserve abundant discriminative information. In this paper, we have mainly focused on dominant feature extraction and action representation. More precisely, this paper emphasizes on the motion and static postures to represent an action, whereas the previously reported methods consider the motion posture only. Indeed, the motion posture images could be unsuccessful to assure capturing utmost of the moving portions with inappropriate employment of the motion posture update function. Additionally, the knowledge about the motionless pose history domains, monotonous activities and monotonous unmoving poses, is discounted in the motion posture frames. Thus, the motion and static posture images are essential simultaneously to address the inter-class similarity and intra-class variation issues. Sometimes, the motion posture images do not contain enough information or maximize the intra-class variations among subjects of same action while the analogous static body postures of those subjects can help to minimize it. Similarly, the motion posture images can increase the inter-class similarity due to subject’s moving fashion where the motionless pose parts could decrease it. Overall, our proposed method addresses the inter-class similarity and the intra-class variation problems in recognizing human actions in depth action sequences. Besides effectiveness, our plan is to test the computational efficiency of the algorithm to implement it in real-time operation.

3 Proposed recognition system

In this section, we present our approach by a comprehensive discussion on feature extraction, action representation and classification techniques.

3.1 Feature extraction

To encode the action features, the MHI and SHI are first derived from a depth action sequence and then the GLAC features are exploited from the resulting images. We discuss this in the following text. However, the well-known Motion History Image (MHI) gathers the description of body-segment movements (presented in depth frames) by contracting the depth sequence to a gray-level picture [20]. But, with inappropriate employment of the update function, the MHI can be unsuccessful to assure capturing utmost of the moving portions. Additionally, the knowledge about the motionless pose history domains, monotonous activities and monotonous unmoving poses is discounted in the MHI pattern [53]. Thus, we consider the Static History Image (SHI) to capture the aforementioned complementary components of the MHI. To generate the MHI and SHI images for each action video, we work with the 3-dimensional Motion Trail Model (3DMTM) [53]. The model provides a set of motion history images {MH1_XOY, MH1_YOZ, MH1_XOZ} and a set of static posture history images {SH1_XOY, SH1_YOZ, SH1_XOZ} corresponding to three 2D Euclidean planes. The motion update function φ _M (x, y, t) and static posture update function φ _S (x, y, t) are used in specifying the domains of moving and unmoving attitudes of subjects per activity completing respectively. These two functions are utilized for each depth frame investigated in the depth action video: $\begin{matrix} φ_{M} (x, y, t) = {\begin{matrix} 1 & if P_{t} > ζ_{M} \\ 0 & otherwise \end{matrix} \\ φ_{S} (x, y, t) = {\begin{matrix} 1 & if d_{t} - P_{t} > ζ_{S} \\ 0 & otherwise \end{matrix} \end{matrix}$ (1)

where (x, y) represents pixel’s location and t represents amount of duration. Furthermore, d _t = (d ₁, d ₂, d ₃, … … … , d _T) is a depth map sequence whereas P _t = (P ₁, P ₂, P ₃, … … … , P _T) holds for a difference picture arrangement representing the absolute dissimilarity for a pair of depth maps. Moreover, both of these update functions want two threshold values; these are ς _M and ς _S for representing motion and stationary knowledge between sequential depth maps. So, the depth motion history image F _m (x, y, t) is gained through employing motion update function φ _M (x, y, t): $F_{M} (x, y, t) = {\begin{matrix} T & if φ_{M} (x, y, t) = 1 \\ F_{M} (x, y, t - 1) - 1 & otherwise \end{matrix}}$ (2)

Furthermore, static posture history image (SHI) F _S (x, y, t) is created by the static posture update function φ _S (x, y, t) to reward for motionless domains for the entire depth video. This could be achieved by analogous technique like MHI: $F_{S} = {\begin{matrix} T & if φ_{S} (x, y, t) - 1 \\ F_{S} (x, y, t - 1) - 1 & otherwise \end{matrix}$ (3)

It should be noted that the 3DMTM also provides the average motion history image (AMHI) and average static posture history image (SHI). In this work, we only consider the MHI and SHI as the AMHI and ASHI reduce the recognition accuracy [54]. Next, we extract gradient and curvature features (which are properties of human contours) from the MHI and SHI images using Gradient Local Auto-Correlation (GLAC) feature descriptor [14]. For an intuitive explanation, let us consider the I (x, y) as a MHI/SHI. Then the magnitudes of the gradient vectors and the relevant orientation angles at each point of I (x, y) are given by $m = \sqrt{(\frac{\partial I}{\partial x}, \frac{\partial I}{\partial y})}$ and $θ = arctan (\frac{\partial I}{\partial x}, \frac{\partial I}{\partial y})$ respectively. Each spatial orientation of the gradient represented by θ is coded into D bins through voting weights to the nearby bins to construct a D-dimensional sparse vector f , which is known as gradient orientation vector (G-O vector shortly). With the utilization of the G-O vector f and the gradient magnitude, the Kth order auto-correlation function of gradients in a neighborhood can be expressed by:

$\begin{matrix} F (d_{0}, \dots d_{K}, a_{1} \dots a_{K}) = \\ \int w [(m (r + b_{1}), \dots, m (r + b_{K})] {gd}_{o} (r) {gd}_{1} (r + b_{1}) \\ \dots {gd}_{K} (r + b_{K}) d r \end{matrix}$ (4)

where b ₁, b ₂, …, b _k are translation vectors about the position vector r = (x, y) , gd indicates the d-th component of g and w represents a weighting function. We in our experimentations shown in Section 4, K ∈ {0, 1}, a _1x,y ∈ {± Δr, 0} and w (.) = min(.) .

In K ∈ {0, 1}, the GLAC features are written as follows: $0^{th} order F_{0} = \sum_{r \in I} m (r) {gd}_{0} (r)$ (5) $1^{st} order F_{1} = \sum_{r \in I} min [(m (r), m (r + b_{1})] {gd}_{0} (r) {gd}_{1} (r + b_{1})]$ (6)

Since there are 4 individual configurations of (boldr, boldr + bolda ₁) as depicted by Fig. 2, the above GLAC feature dimension (boldF ₀ and boldF ₁) is configured by d = D + 4D ². Thus the obtained d-dimensional feature vector corresponding to the MHI _XOY is denoted by GMHI _XOY . Despite the fact that the dimension of GLAC feature vector is high, the vector generation is not so costly because of the sparseness of g. Equations (5 and 6) are utilized to a small number of non-zero components of g. A deeper discussion on the GLAC descriptor is represented in [14].

Fig.2

Configuration patterns for K ∈ {0, 1}.

3.2 Action Representation

From above, three feature vectors GMHI _XOY , GMHI _YOZ and GMHI _XOZ are obtained by employing the GLAC on the set of MHI. The obtained feature vectors are combined end by end to consist a vector GMHI = [GMHI _XOY ; GMHI _YOZ ; GMHI _XOZ] to characterize a human action. Similarly, another feature vector GSHI = [ GHSHI _XOY ; GSHI _YOZ ; GSHI _XOZ] can be constructed based on the set of SHI to represent the action through a complementary perspective. Finally, the above two complement feature vectors are combined to represent the action to enhance the discriminatory power of the recognition method.

3.2 Action classification

The l 2 -regularized Collaborative Representation Classifier ( l 2 -CRC) has been used successfully to recognize human actions in [25, 28]. As a result, after representing an action by combining the GMHI and GSHI feature vectors, we adopt an l 2 -CR C classifier to classify the action representation vector. For an intuitive description of the l ₂-CRC, let us consider an action dataset with C classes. When we arrange the training samples, we can gain a dictionary P = [ P ₁ , P ₂ , … … …, P _C] = [ p ₁ , p ₂ , … … …, p _n] ∈ R^D × N, where D denotes the number of components in a sample, N indicates the entire number of learning examples in the process of action recognition. Also P _j ∈ R^D × M_j,(j = 1,2, ……C) represents a group of the training examples containing to the jth category and p _i ∈ R ^D (i = 1, 2, …… , n) indicates the particular learning action instance. Now, any query action sample S ∈ R^D can be expressed using matrix P as follows: $S = P β$ (7)

Here β is a N × 1 vector related with coefficients equivalent to the learning examples. Solution for Equation (6) is not trivial since it is not usually determined [55]. Generally, the solution is gained as follows: $\hat{β} = \overset{arg min}{β} {| | S - P β | |_{2}^{2} + μ | | A β | |_{2}^{2}}$ (8)

where A is Tikhonov regularization matrix [59] and μ denotes parameter named regularization. The term engaged with A confirms the utilization of earlier information of the solution by employing the method which is described by [56 –58], where the training instances which are very different from a query instance are assigned less weight rather than the learning instances which are very much analogous. Finally, the matrix A ∈ R ^D×N is considered as follows: $A = [\begin{matrix} | | v - p_{1} | |_{2} & 0 & 0 \\ 0 & | | v - p_{2} | |_{2} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & | | v - p_{n} | |_{2} \end{matrix}]$ (9)

Algorithm 1: Proposed algorithm to address human action recognition issue

1 Input: The input depth sequence is utilized to calculate MHIs (MHI _XOY, MHI _YOZ, MHI _XOZ) and SHIs(SHI _XOY, SHI _YOZ, SHI _XOZ) based on Equations (1–3).

2 The GLAC features are extracted from all the MHIs and SHIs via Equations (5 and 6) to form GMHI = [ GMHI _XOY ; GMHI _YOZ ; GMHI _XOZ] and GSHI = [ GSHI _XOY ; GSHI _YOZ ; GSHI _XOZ].

3 The GMHI and the GSHI are combined to gain a single vector.

4 Pass the training feature set

P = {p_{i}}_{i = 1}^{n}

, class label c _i for class partition, test sample S _t, μ, C (number of action classes)

5 Calculate

\hat{β}

using Equation (10)

6 for all j ∈ C do

partition P _j,

\hat{β_{j}}

Calculate

q_{j} = | | S - P_{j} {\hat{β}}_{J} | |_{2}

end for

Decide class ( S ) through Equation (11)

7 Output: class ( S )

In accordance with [60] vector $\hat{β}$ is evaluated by $\hat{β} = (P^{T} P + μ A^{T} A)^{- 1} P^{T} S$ (10)

After that by utilizing the class information of all the learning instances, $\hat{β}$ could be split into C subsets $\hat{β} = [{\hat{β}}_{1}; {\hat{β}}_{2}; {\hat{β}}_{3}; \dots \dots \dots; {\hat{β}}_{C}]$ with ${\hat{β}}_{j} (j = 1, 2, \dots \dots \dots, C) .$ After splitting $\hat{β}$ the class label of the new instance S can be evaluated as $class (S) = \begin{matrix} arg min \\ j \in {1, 2, \dots \dots, C} \end{matrix} {q_{j}}$ (11)

Where q _j is defined by $q_{j} = | | S - P_{j} {\hat{β}}_{J} | |_{2} .$ (12)

Algorithm 1 describes our recognition system concisely.

4 Experiment

The recognition approach is extensively assessed including comparison with state-of-the-art approaches on the three benchmark datasets, i.e., MSRAction3D, DHA [16], and UTD-MHAD [45].

4.1 Results on the MSRAction3D Dataset

MSRAction3Ddataset [15] has 20 actions and each action was accomplished by 10 distinct persons 2 or 3times by fronting the depth sensor. The twenty human action classes are: “high wave (1)”, “horizontal wave (2)”, “hammer (3)”, “hand catch (4)”, “forward punch (5)”, “high throw (6)”, “draw x (7)”, “draw tick (8)”, “draw circle (9)”, “hand clap (10)”, “two hand wave (11)”, “side boxing (12)”, “bend (13)” “forward kick (14)”, “side kick (15)”, “jogging (16)”, “tennis swing (17)”, “tennis serve (18)”, “golf swing (19)”, “pick up and throw (20)”. All of these action types are considered in the context of gaming and include a diversity of movements related to arms, legs, torso, etc. In the dataset, inter-class resemblance amongst multiple action categories are observed, such as draw x and draw tick look analogous with a bit exception in one hand motion. To carry out an extensive assessment of the introduced approach, similar experimentations are taken such as previously adopted by [12 , 22]. Specifically, two different experimental settings (which are described in detail in the following sections) are established to evaluate our method.

First Experimental Setup and Results:

In the first experimental setup, the 20 actions are split into three different human action sets that are represented by Table 1. Three different test cases such as test one, test two and cross subject test are conducted for every action subset [15, 18]. In all the experiments, we set the optimal parameters D = 8 and Δr = 1 for the MHI and SHI based GLAC descriptors by utilizing the training examples through 5-fold cross validation technique. The number of spatial bins are also tuned and set to in the same manner. The l 2 -CRC parameter μ is set to 0.0001 through the 5-fold cross validation technique in the range of 0.00001∼10 on the learning samples. To enhance the computational easiness of the algorithm, Principle Component analysis (PCA) is employed to shrink the dimensions of the obtained action vector (in this setting, the dimension of feature vector is 3168). The PCA transform matrix is gained with the training feature set and then implemented to the test feature set. In all the experiments, the principle components that account for 99% of the entire variation are retained. The feature dimension utilized in each test case is reported in the corresponding accuracy table.

Table 1
Three subsets of the MSR-Action 3D dataset

Label AS1 Label AS2 Label AS3

2 Horizontal wave 1 High wave 6 High throw

3 Hammer 4 Hand catch 14 Forward kick

5 Forward punch 7 Draw x 15 Side kick

6 High throw 8 Draw tick 16 Jogging

10 Hand clap 9 Draw circle 17 Tennis swing

13 Bend 11 Two hand wave 18 Tennis serve

18 Tennis serve 14 Forward kick 19 Golf swing

20 Pickup throw 12 Side boxing 20 Pickup throw

Label	AS1	Label	AS2	Label	AS3
2	Horizontal wave	1	High wave	6	High throw
3	Hammer	4	Hand catch	14	Forward kick
5	Forward punch	7	Draw x	15	Side kick
6	High throw	8	Draw tick	16	Jogging
10	Hand clap	9	Draw circle	17	Tennis swing
13	Bend	11	Two hand wave	18	Tennis serve
18	Tennis serve	14	Forward kick	19	Golf swing
20	Pickup throw	12	Side boxing	20	Pickup throw

In test one , 1/3 examples of every subset are used for model learning, and the rest of the examples are used for evaluation. The individual and average recognition results corresponding to the AS1, AS2 and AS3 action sets are reported in Table 2. The proposed recognition method (i.e., fusion of GMHI and GSHI features) achieves an average recognition accuracy of 99.34%.

Table 2

Recognition results on the three action sets in test one

		GMHI	GSHI	GMHI + GSHI	Feature dimension
Test One	AS1	98.7	90.7	100	70
	AS2	96.7	94.7	98.69	37
	AS3	99.33	94.00	99.33	66
	Average	98.24	93.13	99.34	–

The recognition performances by employing the GMHI and GSHI features only are also shown in the table. There classification rates are inferior to the rates obtained through the proposed fusion method. Only the GMHI features achieve an equal accuracy (which is 99.33%) to the fusion method in AS3 action set. It should be noted in Fig. 3, the GMHI features can exhibit the promising recognition performance for several specific actions where the GSHI features cannot achieve the promising accuracy for those actions and vice versa. For example, in Fig. 3, the GMHI features show the higher recognition accuracy than the GSHI features to classify the high wave action in AS2. On the other hand, the hand catch action in AS2 is recognized more accurately using the GSHI features than using GMHI features. Thus, these two descriptors are complementary enough and by way of merging the GMHI vectors and the GSHI vectors, the whole recognition outcome is upgraded significantly on the situations while utilizing the GMHI features only or the GSHI features only.

In Test Two , the training stage uses 2/3 action instances of each subset and the system assessment turn utilize the residual instances. The recognition accuracies of the GMHI , GSHI methods and our approach are reported in Table 3. The fusion/introduced approach outperforms the GMHI and GSHI methods considerably for all action sets. The fusion method achieves 100% recognition accuracy for all action sets where the GMHI and GSHI are unable to achieve it. The GMHI features attain 100% classification accuracy only for the AS3 action set. The class specific classification accuracies are shown in Fig. 4 to more investigate the supremacy of the introduced fusion system.

Table 3

Recognition results on the three action sets in Test Two

		GMHI	GSHI	GMHI + GSHI	Feature Dimension
Test Two	AS1	98.6	95.9	100	151
	AS2	98.7	97.3	100	137
	AS3	100	97.3	100	149
	Average	99.1	96.83	100

Fig.3

Class specific accuracies in test one.

Fig.4

Class specific accuracies in test two.

In Test Three/Cross Subject Test , action samples executed by one half of the performers (1, 3, 5, 7, 9) are employed for model learning and the samples from rest performers are employed for the system evaluation. With Table 4, it could be demonstrated that the proposed strategy exhibits more improved recognition accuracies for all action sets than the GMHI and GSHI methods in the table. As an example, the introduced algorithm achieves over 7% higher recognition result than GMHI method and over 14% higher recognition rate than the GSHI method. The class specific accuracies for all actions sets with all employed methods are included in Fig. 5.

Table 4

Recognition results on the three action sets in Cross Subject Test

		GMHI	GSHI	GMHI + GSHI	Feature Dimension
Cross Subject Test	AS1	96.23	87.74	98.11	40
	AS2	86.73	80.53	94.69	33
	AS3	93.75	87.5	99.11	65
	Average	92.23	85.25	97.30

Fig.5

Class specific accuracies in cross subject test.

The performances of the introduced system are also assessed through extensive comparison with other methods that were evaluated on MSR-Action3D dataset by the similar experimental settings. The comparison of the average recognition accuracy (%) for the three test cases is shown in Table 5. The highest recognition result is highlighted through bold face. Clearly, it is noticeable that the recognition outcome of the proposed approach exhibits outstanding performance over all the approaches mentioned in the table. It should be observed that our algorithm attains the state-of-the-art accuracy of 99.34%, 100% and 97.3% in the test one, test two and cross subject test respectively. Especially for the most challenging cross subject test, the proposed approach beats the listed methods significantly, leading to 4.1% improvement over the second highest accuracy (93.2% in [42]). In addition, the recognition system shows superiority over the deep learning systems reported in [50]. The recognition outcomes by utilizing the GMHI and the GSHI features are also reported here. It is worth to mention that sometimes the integration of two complementary features increases the inter-class similarity and intra-class variation issues due to human body postures. For example, in Fig. 3, the fused features cannot increase the accuracy for the Hand catch action due to the above issue. But this case is irregular in feature integration and the accuracy improvement is Thus, we can be convinced that the features fusion works better.

Table 5

Average accuracy comparison on the MSR Action 3D dataset on the first setting

Methods	Test One (%)	Test Two (%)	Cross Subject Test (%)
Bag of 3D Points [15]	91.6	94.2	74.7
DMM-HOG [18]	95.8	97.4	91.6
DMM [25]	97.4	99.1	90.5
DMM-LBP-FF [26]	98.7	100	94.9
DMM-LBP-DF [26]	98.2	100	94.7
STOP [22]	96.8	98.3	87.5
HOJ3D [41]	96.2	97.2	79.0
Skeletons Lie Group [47]	–	–	92.5
Evolutionary Joint Selection [42]	–	–	93.2
MS [50]	93.6	94.3	86.3
SMF [50]	96.7	98.7	89.1
BDL [50]	94.1	95.6	87.6
SMF-BDL [50]	97.3	99.1	90.8
Our Method (GMHI + GSHI)	99.34	100	97.3

Second Experimental Setup and Results

The experimental settings reported in [12, 22] is utilized here. All of twenty action classes are used beyond making action subsets and samples from one half of the actors (1, 3, 5, 7, 9) are engaged for model building and the samples obtained from rest actors are used for model assessment. The optimal parameters D = 8, Δr = 1, b _S = 1 ×3 and μ = 0.0001 are determined in the same manner. The dimension of feature vector is minimized from 4752 to 85 by employing the similar technique as discussed in the first experimental setup. The comparison outcomes of the second experimental are illustrated in Table 6. Notice that we take a comparison of our system with the deep structured learning system described in [51]. The comparative recognition accuracies are all presented on their relevant papers. The recognition outcomes in Table 6 indicate that our system is even superior to the deep learning approach.

Table 6

Recognition accuracy comparison on the MSR Action 3D dataset on the second setting

Method	Accuracy (%)
Random Occupancy Pattern [23]	86.5
DMM-HOG [18]	88.7
HON4D [35]	88.9
DSTIP [19]	89.3
Moving Pose [43]	91.7
Actionlet Ensemble [46]	88.2
Skeletons Lie group [47]	89.5
Skeletal Quads [44]	89.9
Super Normal Vector [12]	93.1
2D-CNN [51]	91.2
3D-CNN [51]	86.1
HOG3D+LLC [37]	90.9
Hierarchical 3D Kernel [38]	92.7
DMM-LBP-FF [26]	91.9
DMM-LBP-DF [26]	93.0
Our Method (GMHI + GSHI)	94.5

4.2 Results on the DHA dataset

DHA dataset was firstly introduced by Lin et al. in [16]. In the dataset, action categories are increased from the Weizmann dataset [61], which was generally utilized for human action recognition using RGB action sequences. In fact, DHA dataset consists of twenty three (23) action classes, where the 1 to10 action types follow the similar descriptions in the Weizmann dataset [62] and the 11 to16 activities are added classes. The remaining actions (i.e., 17th to 23rd) belong to the groups of certain game actions. The 23 action groups include: “arm-curl (1)”, “arm-swing (2)”, “bend (3)”, “front-box (4)”, “front-clap (5)”, “golf-swing (6)”, “jack (7)”, “jump (8)”, “kick (9)”, “leg-cur (10)”, “leg-kick (11)”, “one-hand-wave (12)”, “pitch (13)”, “pjump (14)”, “rod-swing (15)”, “run (16)”, “skip (17)”, “side (18)”, “side-box (19)”, “side-clap (20)”, “tai-chi (21)”, “two-hand-wave (22)”, “walk (23)”. The dataset contains overall 483 depth action sequences, where an individual action sequence is executed by 21 performers (12 males and 9 females). In the dataset, inter-similarity among different types of action classes is found. For example, golf-swing and rod-swing actions have analogous motion segments with turning hands from one side up to the other side. Further analogous action couples could be observed in leg-curl and leg-kick, run and walk, etc. The set of entire 23 actions are utilized and samples obtained by one half of the whole actors (1 3 5 7 9 11 13 15 17 19 21) are employed for the model learning stage and the samples from remaining ones are employed for the evaluation session. The parameters D = 8, Δr = 1, b _S = 1 ×2 and μ = 0.0001 are determined as optimal parameters. The feature dimension is reduced from 3168 to 252 to through the PCA algorithm. The recognition accuracy comparison in Table 7. Noticing that our method achieves 99.1% recognition accuracy (see Table 7), which is a remarkable outcome on the dataset. The recognition system achieves 100% recognition accuracy to recognize 21 actions among 23. The algorithm achieves 91% recognition for run and side-clap actions as they are 9% confused with other actions. More precisely, run is confused with rod swing and side-clap is confused with side. However, the proposed system outperforms all existing methods reported in Table 7.

Table 7
Recognition accuracy comparison on the DHA dataset

Method Accuracy (%)

D-STV/AS [16] 86.8

SDM-BSM [36] 89.5

D-DMHI-PHOG [45] 92.4

DMPP-PHOG [45] 95.0

DMM-LBP-DF [26] 91.3

DMMs-FV [33] 95.4

3DHoT-MBC [34] 96.7

Our Method (GMHI + GSHI) 99.1

Method	Accuracy (%)
D-STV/AS [16]	86.8
SDM-BSM [36]	89.5
D-DMHI-PHOG [45]	92.4
DMPP-PHOG [45]	95.0
DMM-LBP-DF [26]	91.3
DMMs-FV [33]	95.4
3DHoT-MBC [34]	96.7
Our Method (GMHI + GSHI)	99.1

4.3 Results on UTD-MHAD dataset

The UTD-MHAD action dataset has 27 different actions accomplished by 8 players (4 females and 4 males). One subject repeats an action 4 times. There dataset contains 861 depth action sequences after deleting 3 unworthy sequences. The twenty seven (27) action categories cover: “right arm swiping to the left (1)”, “right arm swiping to the right (2)”, “right hand wave (3)”, “two hand front clap (4)”, “right arm throw (5)”, “cross arms in the chest (6)”, “basketball shoot (7)”, “right hand draw x (8)”, “right hand draw circle (clockwise) (9)“, “right hand draw circle (counter clockwise) (10)”, “draw triangle (11)”, “bowling (right hand) (12)”, “front boxing (13)”, “baseball swing from right (14)”, “tennis right hand forehand swing (15)”, “arm curl (two arms) (16)”, “tennis serve (17)”, “two hand push (18)”, “right hand knock on door (19)”, “right hand catch an object (20)”, “right hand pick up and throw (21)”, “jogging in place (22)”, “walking in place (23)”, “sit to stand (24)”, “stand to sit (25)”, “forward lunge (left foot forward) (26)”, “squat (two arms stretch out) (27)”. All the depth action sequences are recorded through a Kinect depth device and a inertial device (wearable). For depth sequences 1–21, the wearable device is kept on the actor’s wrist at right where the sensor is positioned on the actor’s thigh at right for the sequences 22th to 27th. Note that the dataset includes an extensive set of different human action categories. As instance, sport actions (e.g., bowling), hand gestures (e.g., draw x), daily activities (e.g., knock on door), and training exercises (e.g., arm curl). The whole of 27 action groups are employed in the experimental evaluation. Action samples captured by 50% of the total actors (1 3 5 7) are involved for the model learning and the samples gained from residual ones are utilized for the system evaluation. The parameters D = 8, Δr = 1, b _s = 3 ×5 and μ = 0.0001 are determined as optimal parameters. The dimension of action feature vector is decreased from 23760 to 94 through the PCA algorithm. The comparison between our system and other existing systems are shown in Table 8.

Table 8
Recognition accuracy comparison on the UTD-MHAD dataset

Method Accuracy (%)

Adaboost.M2 [52] 83.0

DMM-HOG [18] 81.5

Kinect [17] 66.1

Inertial [17] 67.2

Kinect &Inertial [17] 79.1

3DHoT-MBC [34] 84.4

Our Method (GMHI + GSHI) 89.5

Method	Accuracy (%)
Adaboost.M2 [52]	83.0
DMM-HOG [18]	81.5
Kinect [17]	66.1
Inertial [17]	67.2
Kinect &Inertial [17]	79.1
3DHoT-MBC [34]	84.4
Our Method (GMHI + GSHI)	89.5

It is worth mentioning that the comparison table shows our algorithm achieves 5.1% higher recognition accuracy than the best existing algorithm (the accuracy of the indicated algorithm is 84.4%) reported in [34]. To further clarify our algorithm, the confusion matrix is depicted in Fig. 8. In fact, the matrix provides clarification about the class specific accuracy and the misclassification status corresponding to individual action class.

4.4 Statistical inference

Besides the overall accuracy, we evaluate the performance using the recall, specificity, precision and F1-score statistical measures on the MSRACTION3D, DHA [16] and UTD-MHAD [17] datasets to further examine our method. The statistical evaluations are considered on the dataset including all actions rather than using a subset of several actions. Thus, all of the statistical measures are calculated on the relevant confusion matrix (see Figs. 6–8) of MSRACTION3D, DHA and UTD-MHAD datasets, respectively. The results for the above three datasets are reported in Tables 9–11, respectively. All tables include the class-specific outcomes against a single measure as well as the average of the outcomes. The statistical analysis indicates that our approach can identify 100% of the negative samples correctly however, the positive samples are not being recognized accurately sometimes for all datasets. Consequently, we cannot achieve an overall accuracy of 100% for those datasets.

Fig.6

Confusion matrix on the MSRAction3D dataset on setting 2.

Fig.7

Confusion matrix on the DHA dataset.

Fig.8

Confusion matrix on the UTD-MHAD dataset.

Table 9

Statistical measures of the proposed method on MSRAction3D dataset

Action Name	Recall	Specificity	Precision	F1-score
High Wave	0.4167	1.0000	1.0000	0.5882
Horizontal Wave	1.0000	1.0000	1.0000	1.0000
Hammer	1.0000	0.9880	0.8000	0.8889
Hand catch	0.9167	0.9880	0.7857	0.8462
Forward punch	1.0000	1.0000	1.0000	1.0000
High throw	1.0000	1.0000	1.0000	1.0000
Draw x	0.7692	0.9840	0.7143	0.7407
Draw tick	0.9333	0.9877	0.8235	0.8750
Draw circle	0.8000	1.0000	1.0000	0.8889
Hand clap	1.0000	1.0000	1.0000	1.0000
Two hand wave	1.0000	1.0000	1.0000	1.0000
Side boxing	1.0000	0.9959	0.9375	0.9677
Bend	1.0000	1.0000	1.0000	1.0000
Forward kick	1.0000	1.0000	1.0000	1.0000
Side kick	1.0000	1.0000	1.0000	1.0000
Jogging	1.0000	0.9959	0.9375	0.9677
Tennis swing	1.0000	1.0000	1.0000	1.0000
Tennis serve	1.0000	1.0000	1.0000	1.0000
Golf swing	1.0000	1.0000	1.0000	1.0000
Pickup throw	1.0000	1.0000	1.0000	1.0000
Average	0.9418	0.9970	0.9499	0.9382

Table 10

Statistical measures of the proposed method on DHA dataset

Action Name	Recall	Specificity	Precision	F1-Score
Arm-curl	1.0000	1.0000	1.0000	1.0000
Arm-swing	1.0000	1.0000	1.0000	1.0000
Bend	1.0000	1.0000	1.0000	1.0000
Front-box	1.0000	1.0000	1.0000	1.0000
Front-clap	1.0000	1.0000	1.0000	1.0000
Golf-swing	1.0000	1.0000	1.0000	1.0000
Jack	1.0000	1.0000	1.0000	1.0000
Jump	1.0000	1.0000	1.0000	1.0000
Kick	1.0000	1.0000	1.0000	1.0000
Leg-cur	1.0000	1.0000	1.0000	1.0000
Leg-kick	1.0000	1.0000	1.0000	1.0000
One-hand wave	1.0000	1.0000	1.0000	1.0000
Pitch	1.0000	1.0000	1.0000	1.0000
Pjump	1.0000	1.0000	1.0000	1.0000
Rod-swing	0.9000	1.0000	1.0000	0.9474
Run	1.0000	0.9952	0.9091	0.9524
Skip	0.9000	1.0000	1.0000	0.9474
Side	1.0000	1.0000	1.0000	1.0000
Side-box	1.0000	0.9952	0.9091	0.9524
Side-clap	1.0000	1.0000	1.0000	1.0000
Tai-chi	1.0000	1.0000	1.0000	1.0000
Two-hand-wave	1.0000	1.0000	1.0000	1.0000
Walk	1.0000	1.0000	1.0000	1.0000
Average	0.9913	0.9996	0.9921	0.9913

Table 11

Statistical measures of the proposed method on UTD-MHAD dataset

Action Name	Recall	Specificity	Precision	F1-Score
Right arm swiping to the left	0.7500	0.9900	0.7500	0.7500
Right arm swiping to the right	0.9375	0.9875	0.7500	0.8333
Right hand wave	0.6250	1.0000	1.0000	0.7692
Two hand front clap	1.0000	1.0000	1.0000	1.0000
Right arm throw	1.0000	1.0000	1.0000	1.0000
Cross arms in the chest	1.0000	1.0000	1.0000	1.0000
Basketball shoot	0.7500	1.0000	1.0000	0.8571
Right hand draw x	1.0000	0.9899	0.8000	0.8889
Right hand draw circle (clockwise)	0.7500	1.0000	1.0000	0.8571
Right hand draw circle (counter clockwise)	0.7500	0.9900	0.7500	0.7500
Draw triangle	1.0000	0.9899	0.8000	0.8889
Bowling (right hand)	1.0000	1.0000	1.0000	1.0000
Front boxing	1.0000	1.0000	1.0000	1.0000
Baseball swing from right	1.0000	0.9950	0.8889	0.9412
Tennis right hand forehand swing	0.7500	1.0000	1.0000	0.8571
Arm curl (two arms)	0.9375	0.9900	0.7895	0.8571
Tennis serve	1.0000	0.9975	0.9412	0.9697
Two hand push	0.6875	1.0000	1.0000	0.8148
Right hand knock on door	0.8125	0.9875	0.7222	0.7647
Right hand catch an object	1.0000	0.9925	0.8421	0.9143
Right hand pick up and throw	0.6875	1.0000	1.0000	0.8148
Jogging in place	1.0000	0.9925	0.8421	0.9143
Walking in place	0.8000	1.0000	1.0000	0.8889
Sit to stand	0.9375	0.9950	0.8824	0.9091
Stand to sit	1.0000	0.9950	0.8889	0.9412
Forward lunge (left foot forward)	1.0000	0.9950	0.8889	0.9412
Squat (two arms stretch out)	1.0000	1.0000	1.0000	1.0000
Average	0.8954	0.9958	0.9087	0.8934

4.3 Computational efficiency

The computational efficiency of the algorithm is measured through the running time of the major components involved in the algorithm and by the computational complexity of the major components. Indeed, the operation time varies from machine to machine, and hence the computational complexity is taken into account to perceive the efficiency of an algorithm.

Running time:

The proposed approach is tested on CPU platform with an Intel i5-7500 Quad-core CPU @3.41 GHz and a RAM of 16 GB. The processing time of proposed approach depends on the five major components such as 3DMTM based MHI/SHI generation, GMHI feature extraction, GSHI feature extraction, PCA based dimensionality reduction and l₂-CRC classification. The average running time (in millisecond) for the five components per action sample is represented in Table 12. The required time is observed on the MSRACTION3D [15] dataset where each action sample contains depth frames of 40 on average. Note that the total running time for the 40 frames is less than one second, i.e., 630.5±42.53 milliseconds. Consequently, it can be claimed that our recognition method can be applied for real-time operation to process more than 40 frames per second.

Table 12
Running time (mean±std) of the major components of the algorithm

Major Components Running Time (ms)

3DMTM based MHI/SHI generation 606.2±40.2/action sample (40 frames)

GMHI feature extraction 12.2±1.4/action sample (40 frames)

GSHI feature extraction 10.7±0.8/action sample (40 frames)

PCA based dimensionality reduction 0.3±0.07/action sample (40 frames)

l₂-CRC Classification 1.1±0.06/action sample (40 frames)

Total Running Time 630.5±42.53/ 40 frames

Major Components	Running Time (ms)
3DMTM based MHI/SHI generation	606.2±40.2/action sample (40 frames)
GMHI feature extraction	12.2±1.4/action sample (40 frames)
GSHI feature extraction	10.7±0.8/action sample (40 frames)
PCA based dimensionality reduction	0.3±0.07/action sample (40 frames)
l₂-CRC Classification	1.1±0.06/action sample (40 frames)
Total Running Time	630.5±42.53/ 40 frames

Computational Complexity:

The computational complexity of our method is similar to the method reported in [25], and the complexity is not as high as methods listed in Table 13. Although our method has same complexity with the method described in [25], our approach outperforms the method by 6.8% recognition accuracy with the same experimental setup (setting 1) on MSRACTION3D dataset. The proposed system is computationally more efficient compared to other methods included in Table 13.

Table 13

Comparison of computational complexity of our method and other methods

Method	Components	Complexity	Total Complexity
Bag of 3D Points [15]	Bi-gram Maximum Likelihood Decoding (BMLD)	O (J × K _h D ²) J = iteration number, K _h= instance number, D = state dimension	O (J × K _h D ²)
DMM-HOG [18]	Support Vector Machine (SVM)	O (r ³), r = training instance number	O (r ³)
STOP [21]	Principle Component Analysis (PCA), Action Graph Classifier	O (m ³ + m ² r) , O (n _c + r ²) m = feature vector dimension, r = training instance number, n _c= class number	O (m ³ + m ² r) + O (n _c × r ²)
DMM [25]	Principle Component Analysis (PCA), l2-regularized Collaborative Representation Classifier (l₂-CRC)	O (m ³ + m ² r) + O (n _c × r) m = feature vector dimension, r = training instance number, n _c= class number	O (m ³ + m ² r) + O (n _c × r)
EigenJoints [39]	Principle Component Analysis (PCA), Naïve-Bayes-Nearest-Neighbor (NBNN)	O (m ³ + m ² r), O (r × n _c × n _d × log(n _c × n _d)) m = feature vector dimension, r = training instance number, n _c= class number, n _d= descriptor number	O (m ³ + m ² r) + O (r × n _c × n _d × log(n _c × n _d))
HOJ3D [41]	Fisher’s Linear Discriminant Analysis (LDA), Hidden Markov Models (HMM)	O (K _h MP + P ³), O (N _h H ²) P = min(K _h, M) N _h= states number, H = size of instance sequence, M = feature number	O (k _h MP + P ³) + O (N _h H ²)
Our method	Principle Component Analysis (PCA), l2-regularized Collaborative Representation Classifier (l₂-CRC)	O (m ³ + m ² r) , O (n _c × r) m = feature vector dimension, r = training instance number, n _c= class number	O (m ³ + m ² r) + O (n _c × r)

5 Conclusion

In this work, we have introduced an efficacious feature representation strategy through concatenating two sets of features consisting of MHI based GLAC (denoted by GMHI ) and SHI based GLAC (denoted by GSHI ) features’ sets. The GMHI features are adopted to capture the comprehensive texture features from the MHIs of a depth action sequence while the GSHI features are utilized to extract discriminative texture information from the SHIs. Experiments on the public MSR-Action3D, DHA, and UTD-MHAD datasets are carried out to assess the proposed framework extensively. The MSRAction3D dataset is tested on two experimental settings. In setting one, the method achieves 97.3% accuracy for the most challenging cross subject test. In setting two, the obtained recognition accuracy is 94.5% . In both settings, the recognition results are compared with the results based on hand-crafted features and deep learning models. However, the proposed method shows superiority over them. In addition, our method achieves 99.1% and 89.5% recognition results for more challenging DHA, and UTD-MHAD datasets and outperforms all other existing methods. It is worth mentioning that the proposed method is also investigated on the three datasets using statistical measures. The outcomes of those statistical measures indicate effectiveness of the proposed framework besides better classification accuracy. Overall, the experimental results on those datasets demonstrate that the proposed framework consistently surpassed the state-of-the-art action categorization systems with the combination of GMHI and GSHI features. Like the observation of effectiveness of the algorithm, computational efficiency is also taken into account. The recorded processing time of the major components suggests that the algorithm is computationally inexpensive, and compatible for real-time applications. Since the processing time depends on machine configuration, the computational complexity is computed for worst cases/components. The computational complexity of the method is less enough compared to other existing methods. It should be noted that the proposed algorithm considerably outperforms the method with same computational complexity. However, the proposed method finds some difficulties to address the recognition issue. As instance, the 3DMTM method takes a bit more time to compute the MHI/SHI which is a barrier to beat other real-time methods. Although the integration of GMHI and GSHI features, helps to achieve superiority over other methods, the combination of them does not always overcome the issues of the inter-class similarity and intra-class variations as we can see from the class specific accuracies on MSRAction3D with the first experimental setup. Sometimes their concatenation causes inter-class similarity and intra-class variations and leds to inferior result.. In future work, a more efficient MHI/SHI construction method can be developed. Besides, the action representation strategy can be improved to mitigate inter-class similarity and intra-class variations’ disputes.

References

Chen ,

Jafari and

Kehtarnavaz , Improving human action recognition using fusion of depth camera and inertial sensors, IEEE Transactions on Human-Machine Systems 45(1) (2015), 51–61.

Chen ,

Liu ,

Jafari and

Kehtarnavaz , Home-based senior fitness test measurement system using collaborative inertial and depth sensors, In: EMBC, 2014, pp. 4135–4138.

Theodoridis ,

Agapitos ,

Hu and

S.M.

Lucas , Ubiquitous robotics in physical human action recognition: A comparison between dynamic ANNs and GP, In: ICRA, 2008, pp. 3064–3069.

Chen ,

Kehtarnavaz and

Jafari , A medication adherence monitoring system for pill bottles based on a wearable inertial sensor, In: EMBC, 2014, pp. 4983–4986.

A.A.

Chaaraoui ,

Climent-Pérez and

Flórez-Revuelta , A review on vision techniques applied to human behaviour analysis for ambient-assisted living, International Journal of Expert Systems with Applications 39(12) (2012), 10873–10888.

Poppe , A survey on vision-based human action recognition, Journal on Image and Vision Computing 28(6) (2010), 976–990.

Wiliem ,

Madasu ,

Boles and

Yarlagadda , An Update-Describe Approach for Human Action Recognition in Surveillance Video, In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications, Sydney, Australia, 2010, pp. 270–275.

Wang and

Schmid , Action Recognition with Improved Trajectories, In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 3551–3558.

Chen ,

Wei and

Ferryman , A survey of human motion analysis using depth imagery, Pattern Recognition Letters 1995-2006, (2013).

10.

Shotton ,

Fitzgibbon ,

Cook ,

Sharp ,

Finocchio ,

Moore and

Blake , Real-time human pose recognition in parts from single depth images, Communications of the ACM 56(1) (2013), 116–124.

11.

H.-M.

Zhu and

C.-M.

Pun , Human Action Recognition with Skeletal Information from Depth Camera, In Proceedings of the IEEE International Conference Information and Automation, Yinchuan, China, 2013, pp. 1082 –1085.

12.

Yang and

Tian , Super normal vector for action recognition using depth sequences, In: CVPR, 2014, pp. 804–811.

13.

Shotton ,

Fitzgibbon ,

Cook ,

Sharp ,

Finocchio ,

Moore and

Blake , Real-time human pose recognition in parts from single depth images, In: CVPR, 2011, pp. 1297–1304.

14.

Kobayashi and

Otsu , Image feature extraction using gradient local auto-correlations. In:

Forsyth ,

Torr ,

Zisserman , (eds.) ECCV 2008, Part I LNCS, vol. 5302, Springer, Heidelberg, 2008, pp. 346–358.

15.

Li ,

Zhang and

Liu , Action recognition based on a bag of 3D points. In: CVPRW, 2010, pp. 9–14.

16.

Y.C.

Lin ,

M.C.

Hu ,

W.H.

Cheng ,

Y.H.

Hsieh and

H.M.

Chen , Human action recognition and retrieval using sole depth information, in Proc ACM MM, 2012, pp. 1053–1056.

17.

Chen

, R.

Jafari

and N.

Kehtarnavaz

, UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, in Proc IEEE Int Conf Image Process, 2015, pp. 168–172.

18.

Yang ,

Zhang and

Tian , Recognizing actions using depth motion maps-based histograms of oriented gradients, In: ACM Multimedia, 2012, pp. 1057–1060.

19.

Xia and

J.K.

Aggarwal , Spatio-temporal depth cuboid similarity feature for action recognition using depth camera, In: CVPR, 2013, pp. 2834–2841.

20.

A.F.

Bobick and

J.W.

Davis , The recognition of human movement using temporal templates, IEEE Trans Pattern Anal Mach Intell 23(3) (2001), 257–267.

21.

A.W.

Vieira ,

E.R.

Nascimento ,

G.L.

Oliveira ,

Liu and

M.F.

Campos , STOP: Space-Time. Occupancy Patterns for 3D Action Recognition from Depth Map Sequences, In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2012, pp. 252–259.

22.

A.W.

Vieira ,

E.R.

Nascimento ,

G.L.

Oliveira ,

Liu and

M.F.

Campos , On the improvement of human action recognition from depth map sequences using space-time occupancy patterns, Pattern Recognition Letters 36 (2014), 221–227.

23.

Wang ,

Liu ,

Chorowski ,

Chen and

Wu , Robust 3D action recognition with random occupancy patterns, in Proc Eur Conf Comput Vis, 2012, pp. 872–885.

24.

Dalal and

Triggs , Histograms of oriented gradients for human detection, In: CVPR, 2005, pp. 886–893.

25.

Chen ,

Liu and

Kehtarnavaz , Real-time human action recognition based on depth motion maps, J Real-Time Image Process (2013), 1–9. doi: 10.1007/s11554-013-0370-1.

26.

Chen ,

Jafari and

Kehtarnavaz , Action recognition from depth sequences using depth motion maps-based local binary patterns, In: WACV, 2015, pp. 1092–1099.

27.

Ojala ,

Pietikäinen and

Mäenpää , Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Trans Pattern Anal Mach Intell 24(7) (2002), 971–987.

28.

Farhad ,

Jiang and

Ma , Human, Action, Recognition Based On DMMs, HOGs and Contourlet Transform, In: Proceedings of IEEE International Conference on Multimedia Big Data, Beijing, China, 2015, pp. 389–394.

29.

Farhad ,

Jiang and

Ma , Real-Time Human Action Recognition Using DMMs-Based LBP and EOH Feautres, In Proceedings of the International Conference on Intelligent Computing, Fuzhou, China, 2015.

30.

M.F.

Bulbul ,

Jiang and

Ma , DMMs-based multiple features fusion for human action recognition, International Journal of Multimedia Data Engineering and Management (IJMDEM) 6(4) (2015), 23–39.

31.

Chen ,

Hou ,

Zhang ,

Jiang and

Yang , Gradient local auto-correlations and extreme learning machine for depth-based activity recognition, In International Symposium on Visual Computing, Springer International Publishing, 2015, pp. 613–623

32.

Chen ,

Zhang ,

Hou ,

Jiang ,

Liu and

Yang , Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features, Multimedia Tools and Applications 76(3) (2017), 4651–4669.

33.

Chen ,

Liu ,

Zhang ,

Han ,

Jiang and

Liu , 3D action recognition using multi-temporal depth motion maps and Fisher vector, in Proc Int Joint Conf Artif Intell, 2016, pp. 3331–3337.

34.

Zhang ,

Yang ,

Chen ,

Yang ,

Han and

Shao , Action Recognition using 3D histograms of texture and a multi-class boosting classifier, IEEE Transactions on Image Processing 26(10) (2017).

35.

Oreifej and

Liu , HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences, in Proc IEEE Conf Comput Vis Pattern Recognit, 2013, pp. 716–723.

36.

Liu ,

Tian ,

Liu and

Tang , SDM-BSM: A fusing depth scheme for human action recognition, in Proc ICIP, 2015, pp. 4674–4678.

37.

Rahmani ,

Huynh Du ,

Mahmood and

Ajmal , Discriminative human action classification using locality-constrained linear coding, PRL (2015).

38.

Kong ,

Satarboroujeni and

Fu , Hierarchical 3d kernel descriptors for action recognition using depth sequences, In FG, 2015, pp. 1–6.

39.

Yang and

Tian , EigenJoints-Based Action Recognition Using Naïve-Bayes-Nearest-Neighbor, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Province, RI, 2012, pp. 14–19.

40.

Wang ,

Liu ,

Wu and

Yuan , Mining Actionlet Ensemble for Action Recognition with Depth Cameras, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012b, pp. 1290–1297.

41.

Xia ,

C.-C.

Chen and.

Aggarwal , View invariant human action recognition using histograms of 3d joints, In CVPR Workshops, 2012, pp. 20–27.

42.

A.A.

Chaaraoui ,

J.R.

Padilla-López ,

Climent-Pérez and

Flórez-Revuelta , Evolutionary joint selection to improve human action recognition with rgb-d devices, Expert Systems with Applications 41(3) (2014), 786–794.

43.

Zanfir ,

Leordeanu and

Sminchisescu , The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection, In ICCV, 2013, pp. 2752–2759.

44.

Evangelidis ,

Singh and

Horaud , Skeletal quads: Human action recognition using joint quadruples, In ICPR, 2014, pp. 4513–4518.

45.

Gao ,

Zhang ,

G.P.

Xu and

Y.B.

Xue , Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition, Neuro-Computing 151 (2015), 554–564.

46.

Wang ,

Liu ,

Wu and

Yuan , Learning actionlet ensemble for 3D human action recognition, TPAMI 36(5) (2014), 914–927.

47.

Vemulapalli ,

Arrate and

Chellappa , Human action recognition by representing 3D skeletons as points in a lie group, In: CVPR, 2014, pp. 588–595.

48.

Luo ,

Wang and

Qi , Spatio-temporal feature extraction and representation for RGB-D human action recognition, Pattern Recognition Letters (2014), 139–148.

49.

Rahmani ,

Mahmood ,

D.Q.

Huynh and

Mian , Real-Time Action Recognition Using Histograms of Depth Gradients and Random Decision Forests, In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, RI, 2014, pp. 626–633.

50.

Wang ,

Zhang and

Yang , Boosting-like deep convolutional network for pedestrian detection, in Proc Chin Conf Biometric Recognit, 2015, pp. 581–588.

51.

Yang and

Yang , DMM-pyramid based deep architectures for action recognition with depth cameras, in Proc Asian Conf Comput Vis, 2014, pp. 37–49.

52.

Freund and

R.E.

Schapire , A decision-theoretic generalization of on-line learning and an application to boosting, J~Comput Syst Sci 55(1) (1997), 119–139.

53.

Liang and

Zheng , Three dimensional motion trail model for gesture recognition, in Computer Vision Workshops (ICCVW),2013 IEEE International Conference on, 2013, pp. 684–691.

54.

Chen ,

Zhang and

Liang , Action recognition using motion history image and static history image-based local binary patterns, International Journal of Multimedia and Ubiquitous Engineering 12(1) (2017), 203–214.

55.

Wright ,

Ma ,

Mairal ,

Sapiro ,

Huang and

Yan , Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE 98(6) (2010), 1031–1044.

56.

Chen ,

Tramel and

J.E.

Fowler , Compressed sensing recovery of images and video using multi hypothesis predictions, In: Proceedings of the 45th Asilomar Conference on signals, Systems, and Computers, Pacific Grove, CA, 2011, pp. 1193–1198.

57.

Chen ,

Li ,

E.W.

Tramel and

J.E.

Fowler , Reconstruction of hyperspectral imagery from random projections using multi hypothesis prediction, IEEE Transactions on Geoscience and Remote Sensing 52(1) (2014), 365–374.

58.

Chen and

J.E.

Fowler , Single-image super-resolution using multi hypothesis prediction, In: Proceedings of the 46th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 2012, pp. 608–612.

59.

Tikhonov and

Arsenin , Solutions of ill-posed problems, Mathematics of Computation 32(144) (1978), 1320–1322.

60.

Golub ,

P.C.

Hansen and

O’Leary , Tikhonov regularization and total least squares, SIAM Journalon Matrix Analysis and Applications 21(1) (1999), 185–194.

61.

Blank ,

Gorelick ,

Shechtman ,

Irani and

Basri , Actions as space-time shapes, in Proc 10th IEEE Int Conf Comput Vis, Beijing, China, 2005, pp. 1395–1402.

62.

Gorelick ,

Blank ,

E.S.M.

Irani and

Basri , Actions as space-time shapes, TPMAI 29(12) (2007), 2247–2253.

63.

Ali ,

S.N.

Tran ,

Benetos and

Garcez , Speaker recognition with hybrid features from a deep belief network, Neural Computing and Applications 29(6) (2018), 13–19.

64.

Ali ,

Ahmad ,

Zhou ,

Iqbal and

Muhammad Ali , DWT features performance analysis for automatic speech recognition of Urdu, SpringerPlus 3(204) (2014). doi: 10.1186/2193-1801-3-204.

Human action recognition using MHI and SHI based GLAC features and Collaborative Representation Classifier

Abstract

Keywords

1 Introduction

3 Proposed recognition system

3.1 Feature extraction

3.2 Action classification

4.1 Results on the MSRAction3D Dataset

Table 7 Recognition accuracy comparison on the DHA dataset Method Accuracy (%) D-STV/AS [16] 86.8 SDM-BSM [36] 89.5 D-DMHI-PHOG [45] 92.4 DMPP-PHOG [45] 95.0 DMM-LBP-DF [26] 91.3 DMMs-FV [33] 95.4 3DHoT-MBC [34] 96.7 Our Method (GMHI + GSHI) 99.1

Table 8 Recognition accuracy comparison on the UTD-MHAD dataset Method Accuracy (%) Adaboost.M2 [52] 83.0 DMM-HOG [18] 81.5 Kinect [17] 66.1 Inertial [17] 67.2 Kinect &Inertial [17] 79.1 3DHoT-MBC [34] 84.4 Our Method (GMHI + GSHI) 89.5

References

Table 7
Recognition accuracy comparison on the DHA dataset

Method Accuracy (%)

D-STV/AS [16] 86.8

SDM-BSM [36] 89.5

D-DMHI-PHOG [45] 92.4

DMPP-PHOG [45] 95.0

DMM-LBP-DF [26] 91.3

DMMs-FV [33] 95.4

3DHoT-MBC [34] 96.7

Our Method (GMHI + GSHI) 99.1

Table 8
Recognition accuracy comparison on the UTD-MHAD dataset

Method Accuracy (%)

Adaboost.M2 [52] 83.0

DMM-HOG [18] 81.5

Kinect [17] 66.1

Inertial [17] 67.2

Kinect &Inertial [17] 79.1

3DHoT-MBC [34] 84.4

Our Method (GMHI + GSHI) 89.5