Real time human action recognition from RGB clips using local motion histogram

Abstract

This paper proposes a method to human action recognition from RGB video clips. The method is based on capturing the local motion information from smaller size video clips. Local motion information is captured through accumulation of motion in different shape and size of patches of spatial domain. The motion information is then transformed to motion histograms. Further, all the histograms are concatenated to make the proposed feature vector. Bagging ensemble technique, in form of random forest, is used for classification. The idea is further extended to real time human action recognition mechanism. To show the robustness and efficiency of proposed algorithm, it is performed on publicly available human action datasets Joint-annotated Human Motion Data Base (JHMDB) [29] and University of Rzeszów (UR) Fall detection dataset [19]. The results are also compared with other state of art methods.

Keywords

Human action recognition histogram random forest RGB camera real time fall detection

1. Introduction

Recognizing Human actions from video has been an active research area for years. It is a challenge to detect human actions in video streams due to variations in pose, different camera calibration, wide variety of (loose and fit) apparels, clutter background and various lightning conditions. Camera movement makes it even more difficult. Potential applications includes elderly and patients care, kids daycare, group monitoring, sports training, surveillance, and abnormal behavior detection. Image based features have made considerable advances in recent years [1, 2, 3], but they are too mature to many practical applications. On the other side, certain kinds of movement are humans characteristic, so classification accuracy can potentially be improved by considering motion information. Many researchers in this area assumed that the camera and the background are essentially static. This greatly simplifies the problem because the mere presence of motion already provides a strong cue for human presence. For example, Srivastava and Biswas [4] captured the motion in form of histograms and it is used for human action detection. Henawy et al. [5] presented a process for multidimensional time evolving data modeling and classification. The authors are proposed a stabilized higher order linear dynamical system for classification. Bag of words approach is used for classification of multidimensional signals which could be video signals as well. Ganapathi and Prakash [6] presented a global 3D descriptor for human recognition. The proposed descriptor is calculated by concatenating the histograms of point clouds in specific annular region. Song et al. [7] proposed a learning process with the help of regularized cross-entropy for action recognition. The authors also proposed a process for action temporal proposals generation for action detection. Zhang et al. [8] proposed a 3D histograms of texture for feature extraction from depth data. A classifier is also proposed named multi-class boosting. Chen et al. [9] presented a descriptor, which is calculated in three stages. On first stage depth motion maps is calculated. Texture histograms of depth patches are calculated in stage two while compact representation of features are extracted with the help of Fisher kernel in stage three. Colque et al. [10] presented a descriptor Histograms of Optical Flow Orientation and Magnitude (HOFM). The HOFM is used to detect anomalous activities in videos. Khan et al. [11] presented a model to detect human fall when fall data is not available but sufficient data is available for ADL. Wearable devices and X-factor hidden Markov model (XHMM) are used in proposed model. Sanaz et al. [12] proposed an approach for detecting human fall in industrial environment without wearable devices. The authors used human induced diffraction for fall detection. Casilari et al. [13] present a survey on available wearable sensor approaches and datasets to detect the same. Chun and Lee [14] worked for multi camera view for human action. They proposed an optical flow based histogram for human action recognition. Zhang and Parker [15] proposed CoDe4D features for action recognition. The CoDe4D is extracted, from RGB-D data, as multi-channel orientation histogram. To make the method view independent, Dogan et al. [16] proposed 3D volume motion templates (VMTs). A rotation with respect to a canonical orientation is made by the authors. Luo et al. [17] proposed to model the motion dynamics parameters as motion descriptor for action recognition. The model is designed with histograms of oriented gradients and robust linear dynamical systems. Zhou and Zhang [18] used multiple-instance formulation to find elementary actions with stable states. Encoding of local parts movements of human action is proposed. Kwolek and Kepski [19] used accelerometric data and RGB-D data for detecting human fall. The authors track the accelerometric and depth data for potential human fall. And whenever it crosses a set threshold, it is verified for fall detection in synchronized RGB data. Cheng et al. [20] presented a framework for activity recognition. The framework is using surface electromyography and accelerometer signals for activity recognition. To detect the boundaries of the activity the authors used histogram of negative entropy. Mukherjee et al. [21] used histogram of oriented optical flow and a bag of word approach to compute the descriptor. A graph theoretic technique is proposed for human actions recognition by the authors.

In this paper we describe our endeavor to propose methods for efficient human action recognition in real time. Rest of the paper is designed as follows: Section 2 consists of some related work in the area. The proposed methods are discussed in the Section 3. In Section 4, there are descriptions of our experiments, datasets of interest and real time implementation details of proposed method. Section 4 also contains results comparisons with other state of art methods. The conclusion is given in Section 5.

Figure 1.

(a) Video volume’s frames for pour activity, (b) transformed video sub volume’s frames and (c) MPM for one sub volume of frames.

2. Related work

Colque et al. [10] presented a model for grabbing anomalies in human actions. The authors used orientation, velocity, and entropy to compute optical flow histogram as a feature. Martínez et al. [22] presented an algorithm for online action recognition and shown the average accuracy on per frame basis. Flow orientation histograms of variable but small sized and overlapped patches for each frame is calculated in the proposed algorithm. SVM is used for action classification. Ni et al. [23] introduced a framework for Motion Part Regularization (MPR). The framework is used for mining dense trajectories discriminative groups. The representative group is used for action detection. Ma et al. [24] designed a hierarchical spatio temporal trees. These trees is used for making vocabulary of similar actions for partial or complete human motion. The vocabularies are used for classification. Fernando et al. [25] enumerated human actions in chronological order by exploiting temporal ordering in videos. The enumeration is done using ranking learning framework for recording of significant information. Zhang et al. [26] proposed a noise immune learning framework for human action recognition. The framework is designed with the help of relative attributes of human action in videos. Vrigkas et al. [27] presented a learning-based framework. Gaussian mixture model is used for clustering of motion curves and longest common subsequence is used to calculate feature vector. Viola et al. [28] improved a pedestrian detection system by improving its performance in low resolution camera and bad weather conditions. To increase the system performance, Viola et al. [28] used a detector by combining appearance and motion.

The researchers (of human action recognition area) have contributed a lot in terms of proposing various descriptors and classifiers. There are still many scope for further work because results are not at par in practical scenarios. In this paper we present a simple but effective motion histogram based approach for detection of motion along vertical and horizontal patches of video sub volumes. This helps in increasing immunity of feature vector with respect to the relative placement of camera with object or actor. The detail descriptions are given in following section.

3. Proposed method

The activity video clips in various datasets are of different length in nature due to following reasons (i) various actors performed an (same) activity with different speed, (ii) various activities are of different length in nature.

So every video clip is divided in to fixed and equal length video sub volumes. That is, if a video volume is divided in to $S$ sub volumes, then every video sub volume has fixed $F$ frames of frame size $M\times N$ . A typical video sub volume has $F\times M\times N$ pixel values which, if treated as feature vector then the feature vector would have multiple redundant values and so high dimension. To reduce the size of each video sub volume, while containing relevant information, temporal information (in the video sub volume) is captured. The temporal information of video sub volume is captured through inter frame motion accumulation. The inter frame motion for $(i,j)^{\rm th}$ pixel between two consecutive frames $f_{k}$ and $f_{k+1}$ is $d_{k}(ij)$ and calculated as in Eq. (2), where intensity of (i, j) ${}^{\rm th}$ pixel of frame $f_{k}$ is represented by $f_{k}(i,j)$ . Since a pixel intensity is represented by ${[f}_{k}^{r}(i,j),f_{k}^{g}(i,j),f_{k}^{b}(i,j)]$ vector for a typical color frame, where $f_{k}^{r}(i,j)$ is red intensity of $(i,j)^{\rm th}$ pixel of frame $f_{k}$ , green and blue intensities are represented by $f_{k}^{g}(i,j)$ and $f_{k}^{b}(i,j)$ respectively in the vector. It is converted to a scalar quantity $f_{k}(i,j)$ by Eq. (1)

$\displaystyle f_{k}(i,j)=\sqrt[2]{{f_{k}^{r}(i,j)}^{2}+{f_{k}^{g}(i,j)}^{2}+{f% _{k}^{b}(i,j)}^{2}}$ (1) $\displaystyle d_{k}(ij)=\left|f_{k+1}(i,j)-f_{k}(i,j)\right|$ (2)

where $i=1,2,3,\ldots M$ ; $j=1,2,3,\ldots N$ ; $k=1,2,\ldots,S-1$ .

The motion between two consecutive frames for one pixel is calculated by Eq. (2) and the process will repeat for all the pixels, so the size of $d_{k}$ would be M x N. Now if sub video volume had $S$ frames then $[d_{1},d_{2},\ldots ds-_{2},ds-_{1}]$ will be calculated. Further, to accumulate all the motion in video sub volume for a pixel motion projection matrix (MPM) $P$ is calculated as per Eq. (3)

$\displaystyle P(i,j)=\max(d_{1}(ij),{d}_{2}(ij),\ldots,{d}_{S-2}(ij),{d}_{S-1}% (ij))$ (3)

where $i$ and $j$ would take their values from $(1,M)$ and $(1,N)$ respectively. Here onward a MPM is a reprehensive matrix for a video sub volume. A typical MPM is depicted in Fig. 1c. A MPM, though stored all the motions in a particular video sub volume, it could not be consider as feature vector. It is so because some motion in upper left area of MPM could not be distinguished with same amount of motion in some other area (say lower right area) of MPM. We propose patch wise motion matching in MPM with the help of motion histograms.

There could be multiple ways to divide the MPM in to patches. Patches could be of fixed or variable size, could be of different shapes. But to make proposed algorithm computationally economical fixed sized and rectangular shaped patches are chosen. Splitting MPM in to grids patches, as shown in Fig. 2a, would record the motion in very specific region. Small area specific motion recording would lead to over-fitted classifier.

So, we propose, the MPM division into equal sized C vertical (or column) patches and equal sized R horizontal (or row) patches independently. Every vertical patch is immune to differentiate upper region motion with lower region motion while it (vertical patch) is sensitive to differentiate between motions in left region with motion in right region. The same would be true for horizontal patches by swapping upper and lower to left and right. Now vertical patches recording the motion which is sensitive to left and right region while immune to the motion performed in upper and lower region of video sub volume. And horizontal patches recording the motion which is sensitive to upper and lower region while immune to the motion performed in left and right region of video sub volume. Another challenge to this approach is not to differentiate between similar motion performed “near to camera” and “away from camera”. But the proposed approach would also take care of the mentioned challenge due to the reasoning explained above.

Combining these two types of motion information (horizontal and vertical patch motion) as a feature vector would intuitively lead to train a classifier neither under fitted nor over fitted manner. The intuition is reflected to be correct in Section 4 of this paper.

Figure 2.

Segmenting $P$ for feature vector calculation and matching (a) grid patch splitting and feature matching, (b) column patch splitting and feature matching.

Motion storage in a typical MPM is done in form of pixel intensity. That is a black pixel intensity value represents no motion at the pixel location while gray and white pixel intensity values represent low and high magnitude motions respectively at the pixel location. The motion information in a single patch is stored in form of pixel intensity values. The motion information is converted to intensity histogram with B number of bins to make the size of feature vector fixed and smaller in comparison to if it were stored in form of raw pixel values. The histograms for horizontal patches are represented by $[H_{h_{1}},H_{h_{2}},\ldots,H_{h_{R}}]$ where $H_{h_{i}}$ represents ith horizontal patch histogram while vertical patch histograms are represented by $[H_{v_{1}},H_{v_{2}},\ldots,H_{v_{C}}]$ where $H_{v_{i}}$ represents ith vertical patch histogram. Our proposed feature vector $H$ would be obtained by simply concatenating the all histograms as shown in Eq. (3).

$\displaystyle H=[H_{h_{1}},H_{h_{2}},\ldots,H_{h_{R}},H_{v_{1}},H_{v_{2}},% \ldots,H_{v_{C}}]$

The proposed feature vector $H$ would have $B*(C+R)$ integer values.

Figure 3.

MPM’s horizontal and vertical patch splitting with histogram name indication.

In Fig. 3 horizontal and vertical patches of a typical MPM are depicted. It also contains the used nomenclature of histograms.

4. Classification

Most of the available human action dataset having multiple action classes. So any deputed classifier for human action must having property of multiclass classification. Multiclass classification behavior could obtained by combining multiple binary classifier (one for each class) with one class versus all other classes approach. But the approach would lead to imbalance data from a binary classifier point of view because almost every available human action dataset have same number of video clips for every action of the dataset.

Random forest is a multi-class classifier having multiple decision trees. Every tree of the forest is trained by some (randomly chosen) features from the set of all features. To decrease the correlation between such trees, bagging (Bootstrap Aggregation) is used in random forest. After training of random forest, when a test class comes before random forest, it collects the decision (or vote) of all decision trees and the class which received maximum votes, declared as output class of the random forest.

So we chose random forest for human action classification in proposed method. Publicly available datasets named JHMDB [29] and UR Fall detection [19] have been chosen for evaluating our proposed method. We further chosen classification accuracy as an performance measure of our experiments. The classification accuracy is calculated as in Eq. (5)

Classification accuracy $\displaystyle=\frac{\text{Number of correct predictions}}{\text{Total number % of predictions}}.$ (5)

4.1 JHMDB dataset [29]

The JHMDB dataset consists of video clips of 21 different human activities. The dataset consists of more than 30 clips per action class and each clip containing 15 to 40 number of frames. The dataset is composed of video clips from different sources like movie clips, youtube videos and google videos. The description of all the activities and dataset complexity in terms of body part visibility (in video clips) are shown as horizontal bar chart in Fig. 4.

Figure 4.

Action of HMDB [29] dataset and video distribution on the basis of body parts visibility of actors.

Table 1

Classification accuracies of the experiments on JHMDB dataset on different parameter values

Exp. No.	$S$	$T$	$R$ or $C$	Classification accuracy
1	5	100	10	51.75
2	7	100	10	51.43
3	5	50	10	51.1
4	3	50	10	50.16
5	3	100	5	49.52

The videos are recorded with different types of cameras, resolutions, lighting conditions and unconstrained environment. Occlusion is also presented in videos with different magnitudes. Figure 5 depicts some of the actions of JHMDB dataset, columns 1–4 of the figure showing catch, jump, wave and push activity respectively. While rows of the figure show snap shots of actions at two different time instances.

Table 2

Comparison of proposed method with Jhuang et al. [29] methods

Sr No	Researcher	Method	Feature used in method
			Traj.	HOG	HOF	HOLM
1	Jhuang et al. [29]	1) baseline	40.0	32.9	40.1
		2) of pmask	38.5	31.9	46.0
		3) pf pmask	36.4	32.8	48.0
		4) pf Dmask	38.0	32.2	46.4
		5) pf pmask of outside pmask	43.0	36.1	44.1
2	Proposed	Row and col wise splitting				51.75

Figure 5.

Snapshots of catch, jump, wave and push actions of JHMDB dataset in column wise manner. Upper and lower snapshots of every columns are taken at two different time instance.

We have performed various experiments on the JHMDB dataset for different values of $S$ (number of frames in a video sub volume), $R$ (number of horizontal patches in MPM), $C$ (number of vertical patches in MPM) and number of decision trees $T$ in random forest. Classification accuracy on specific parametric values are mentioned in fifth column of every row of the Table 1. In our experiments the values of $R$ and $C$ are chosen to be equal.

Experiments are enumerated in order to their accuracies. Highest accuracy is achieved when the parameters are taken as shown in experiment 1. Detailed result discussion for JHMDB in the rest of the paper is done in context of experiment 1. The level of challenge in JHMDB is quite high due to its variability. For example there is no restriction on camera movements, camera angle, distance of camera with actor and angle of light sources with respect to object.

Figure 6.

Activity wise classifier accuracy for JHMBD dataset.

The proposed algorithm worked well on challenging dataset JHMDB for many activities while some activities are significantly misclassified. Pull-up and shoot bow activities are classified with significantly good accuracy of 89% and 83% respectively. There are four more activities which have been classified with accuracy more than 70%. Activities of sit, run, walk andkick ball are significantly misclassified. The reason for significantly misclassification could be high degree of closeness among some activities or high degree of occlusion etc. The overall classification accuracy 51.75% is achieved for proposed algorithm. Activity wise classifier accuracy is shown in Fig. 6. To increase the visibility of accuracy graph in Fig. 6, the activities are being renamed as brush hair activity is renamed as ‘a’, cart wheel activity is renamed as ‘b’ similarly catch activity is renamed as ‘c’ and so on. Activity sequence is taken same as depicted in Fig. 4.

Table 3

Experimental results for optimal parameter values on UR fall dataset

Experiment number	Number of frames in a bundle (B)	Number of bins for histogram calculation	Number of trees in random forest	Number of rows and column segment in a motion projection matrix $(R,C)$	Accuracy
1	5	10	50	10, 15	95.52
2	5	15	50	10, 15	95.32
3	5	15	50	18, 24	94.93
4	5	15	100	18, 24	94.35

Figure 7.

Frames of Fall from Sitting, Fall from standing from UR Fall Dataset, while Bend and pick, Look under furniture (column wise) activities of ADL dataset.

Results of proposed method are compared with other state of art techniques in Table 2. It can be seen that our approach is performing better than all the histogram and trajectory based approaches of Jhuang et al. [29].

5. UR fall and ADL dataset [19]

This dataset contains 70 clips, in which 30 clips are of two types of fall activity, one is fall from standing, second is fall from sitting. The remaining 40 clips are ADL (activity of daily living). ADL includes bend and pick an item, look below the furniture and sit on chair activities.

Every activity is performed once or twice by 5 persons. The ADL activities are partially similar to fall activity so it is a challenging dataset to detect fall activity in query clip. Figure 7 depicts some of the activities of UR-Fall detection and ADL dataset, every column represent an activity (Fall from Sitting, Fall from standing, Bend and pick, Look under furniture). First row of the figure shows starting gesture of the activity while second row represents one of the last gesture of the activity.

As per our observation, falling is an instant/short duration activity (less than a second), which involves quick movement of an object in a very short duration. Detection of the high magnitude movement in short duration could be distinguished from low magnitude movement in ADL. Real time Fall detection could serve a meaningful purpose in the real world scenario. For instance, one could raise an alarm on detection of a falling action.

Using our scheme, we could sense if contiguous bundles are being labelled falling, and raise an alarm in real time. In our experiments we found that sensing three continuous fall detection in motion projection matrices are sufficient for accurate detection of fall activity with minimum false positives.

In the UR-fall dataset we manually label the frames as falling in which actually falling is observed, and the rest of the frames are labeled as non-falling in each clip. Further we apply proposed algorithm to calculate the feature vectors as per Algorithm 1 and the whole process is depicted in Fig. 9. To show the effectiveness of the proposed algorithm, three different data-splitting has been chosen for training and testing purpose. In all the following data configurations 70%–30% splitting are done for training-testing.

(a) (a)
Training and testing from UR fall dataset only
(b)
Training from UR fall dataset and testing with ADL dataset
(c)
Training and testing from UR fall dataset and ADL dataset collectively

To choose the optimal value of parameters (bundle size, number of bins, Number of trees in random forest, number of horizontal and vertical patches) in the proposed algorithm, some experiments are done on UR fall dataset as (a) splitting. The results are shown in Table 3. It is observed that as the number of horizontal and vertical patches are increased in Motion Projection Matrices (MPM), accuracy drops. Similarly increasing the number of bins for histogram calculation also decreases accuracy. In the fall detection scenario, background and camera is static while the motion is only due to object under observation, so higher value of parameters lead to dipping in accuracy.

Table 4
Accuracies of proposed algorithm under splitting (b) and (c)

Training testing splitting Number of frames in a bundle (B) Number of bins for histogram calculation Number of trees in random forest Number of rows and column segment in a motion projection matrix $(R,C)$ Accuracy

(b) 5 10 100 10, 15 98.93

(c) 5 10 100 10, 15 97.52

Table 5
Comparison of proposed algorithm accuracy with [19] for fall detection and not-fall detection

Fall detection Non-fall detection

Kwolek and Kepski [19] RGB-D data 90.00% SVM 99.67%

RGB-D $+$ accelerometer data 98.33%

Proposed algorithm RGB data only 97.52% Random forest 98.93%

For splitting (b) and (c), parameter values are chosen as per experiment number 1, except the number of trees in random forest in Table 3. The number of trees is increased to 100, because size and variety of training samples are increased in splitting (b) and (c).

In splitting (b) training is done by UR fall detection dataset only and testing is performed on ADL dataset which have activities close to falling but not actual falling. The accuracy (98.93%) shown by proposed algorithm is significantly acceptable level. It also shows very low false alarm about falling, which is also an important aspect of fall detection system.

In splitting (c) all the activities from both the datasets (UR Fall and ADL) are divided in to 70–30 ratio for training and testing purpose. The accuracy achieved in this configuration is 97.52 which is slightly lower than previous (c) splitting. It is a bit lower because the classifier is trained for additional activities (from ADL) which was not the case in splitting (c).

Figure 8.
Processing time (in seconds) for every MPMs for UR fall detection and ADL dataset.

Kwolek and Kepski [19] prepared the UR fall and ADL dataset. They also proposed a system for fall detection using synchronized accelerometer, depth and RGB data. For depth data, a MS Kinect or similar device is required and for accelerometer data, object have to wear a device. Wearing a device is an overhead for elderly people.

We compare accuracy of fall detection and non-fall detection of proposed algorithm with [19] in Table 5. In fall detection section (of Table 5), proposed algorithm is performing much better with respect to [19] when only RGB-D data is used. The performance of proposed algorithm is slightly lower than [19] when accelerometer data along with RGB-D data is used. Further on non-fall detection section in the table, when only ADL data is used for testing purpose, accuracy of the proposed algorithm is comparable. The accuracy is comparable because proposed algorithm is using only RGB data while the [19] is using RGB-D data along with accelerometer data.
5.1 Real time implementation details

	Fall detection	Non-fall detection
Kwolek and Kepski [19]	RGB-D data	90.00%	SVM	99.67%
	RGB-D $+$ accelerometer data	98.33%
Proposed algorithm	RGB data only	97.52%	Random forest	98.93%

CCTV cameras normally maintain its recording at 10–15 frames per second, which means at every 0.07–0.10 seconds a new frame arrives. So any real time algorithm should keep its average processing (per frame) time below 0.10 seconds.

Proposed real time algorithm is implemented in Matlab R2015a, on operating system Windows 7. Hardware specifications are 2 GB RAM, Intel i3 2.20 GHz processor. To show the usefulness of the framework in real time processing time is calculated. The processing time is calculated as sum of time of calculating feature vector for one motion projection matrix MPM, and time elapsed in classifying a query class. The equation is shown below as Eq. (6).

(6)Processing time (per MPM) $\displaystyle=\text{feature vector calculation time for a MPM}$ $\displaystyle\quad+\text{classifier response time for a feature vector}$

Figure 9.
Flow chart for real time fall detection.

Processing time per MPM (in second) is depicted in Fig. 8. The $X$ -axis refers to all the Motion Projection Matrices involved in the total fall dataset (30 videos in all). The $Y$ -axis shows the processing time of each MPM. Processing time for 7 ${}^{\rm th}$ MPM is 0.483 second, 10 ${}^{\rm th}$ MPM is 0.476 second and so on. The average processing time per MPM is 0.47 Seconds. Since every MPM contain 5 frames in the experiment, so processing time per frame is 0.095 seconds in the setup. This processing time can be significantly improved with the help of GPUs and embedded technologies.
6. Conclusion

This paper proposed a for human action recognition from RGB video clips. Firstly motion was accumulated in smaller video clips in form of motion projection matrix (MPM), then MPM is divided into horizontal and vertical patches independently. Every patches are converted to histogram followed by concatenation of all histograms to make a feature vector. Bagging ensemble technique, in form of random forest, is used for classification. The performance is evaluated on publicly available human action datasets JHMDB [29]. We also extend the idea for real time performance on fall detection. The real time algorithm is presented in the paper. For evaluation purpose of proposed algorithm we performed classification (fall detection) on UR fall dataset and compare the results with other state of art methods. It is seen that performance of our scheme, based on just RGB data gives comparable results to [19] who employ RGB-D and accelerometer data.

References

Aggarwal

Ryoo

. Human activity analysis: A review. ACM Computing Surveys (CSUR). 2011 Apr 1; 43(3): 16.

Dalal

Triggs

. Histograms of oriented gradients for human detection. Ininternational Conference on computer vision & Pattern Recognitio. (CVPR’05), IEEE Computer Society. 2005 Jun 20; 1: 886-893.

Leibe

Seemann

Schiele

. Pedestrian detection in crowded scenes. in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. 2005 Jun 20; 1: 878-885.

Srivastava

Biswas

. Human activity recognition using local motion histogram. InInternational Conference on Next Generation Computing Technologies. Springer, Singapore. 2017 Oct 30; 908-917.

El-Henawy

Ahmed

Mahmoud

. Action recognition using fast HOG3D of integral videos and Smith–Waterman partial matching. IET Image Processing. 2018 Jan 4; 12(6): 896-908.

Ganapathi

Prakash

. 3D ear recognition using global and local features. IET Biometrics. 2018 Jan 22; 7(3): 232-41.

Song

Lan

Xing

Zeng

Liu

. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Transactions on Image Processing. 2018 Jul; 27(7): 3459-71.

Zhang

Yang

Chen

Yang

Han

Shao

. Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Transactions on Image Processing. 2017 Oct; 26(10): 4648-60.

Chen

Liu

Zhang

Han

Kehtarnavaz

. Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition. IEEE Access. 2017; 5: 22590-604.

10.

Colque

Caetano

de Andrade

Schwartz

. Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos. IEEE Transactions on Circuits and Systems for Video Technology. 2017 Mar; 27(3): 673-82.

11.

Khan

Karg

Kulić

Hoey

. Detecting falls with x-factor hidden markov models. Applied Soft Computing. 2017 Jun 1; 55: 168-77.

12.

Kianoush

Savazzi

Vicentini

Rampa

Giussani

. Device-free RF human body fall detection and localization in industrial workplaces. IEEE Internet of Things Journal. 2017 Apr; 4(2): 351-62.

13.

Casilari

Santoyo-Ramón

Cano-García

. Analysis of public datasets for wearable fall detection systems. Sensors. 2017; 17(7): 1513.

14.

Chun

Lee

. Human action recognition using histogram of motion intensity and direction from multiple views. IET Computer Vision. 2016 Feb 12; 10(4): 250-7.

15.

Zhang

Parker

. Code4d: color-depth local spatio-temporal features for human activity recognition from rgb-d videos. IEEE Transactions on Circuits and Systems for Video Technology. 2016 Mar; 26(3): 541-55.

16.

Dogan

Eren

Wolf

Baskurt

. Activity recognition with volume motion templates and histograms of 3d gradients. in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE. 2015 Sep 27; 4421-4425.

17.

Luo

Yang

Tian

Yuan

Maybank

. Learning human actions by combining global dynamics and local appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014 Dec 1; 36(12): 2466-82.

18.

Zhou

Zhang

. Human action recognition with multiple-instance Markov model. IEEE Transactions on Information Forensics and Security. 2014 Oct; 9(10): 1581-91.

19.

Kwolek

Kepski

. Human fall detection on embedded platform using depth maps and wireless accelerometer. Computer Methods and Programs in Biomedicine. 2014 Dec 1; 117(3): 489-501.

20.

Cheng

Chen

Shen

. A framework for daily activity monitoring and fall detection based on surface electromyography and accelerometer signals. IEEE Journal of Biomedical and Health Informatics. 2013 Jan; 17(1): 38-45.

21.

Mukherjee

Biswas

Mukherjee

. Recognizing human action at a distance in video by key poses. IEEE Transactions on Circuits and Systems for Video Technology. 2011 Sep; 21(9): 1228-41.

22.

Martínez

Manzanera

Romero

. Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition. IET Computer Vision. 2017 May 4; 11(7): 541-9.

23.

Moulin

Yang

Yan

. Motion part regularization: Improving action recognition via trajectory selection. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015; 3698-3706.

24.

Sigal

Sclaroff

. Space-time tree ensemble for action recognition. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015; 5024-5032.

25.

Fernando

Gavves

Oramas

Ghodrati

Tuytelaars

. Modeling video evolution for action recognition. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015; 5378-5387.

26.

Zhang

Wang

Xiao

Zhou

Liu

. Robust relative attributes for human action recognition. Pattern Analysis and Applications. 2015 Feb 1; 18(1): 157-71.

27.

Vrigkas

Karavasilis

Nikou

Kakadiaris

. Matching mixtures of curves for human action recognition. Computer Vision and Image Understanding. 2014 Feb 1; 119: 27-40.

28.

Viola

Jones

Snow

. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision. 2005 Jul 1; 63(2): 153-61.

29.

Jhuang

Gall

Zuffi

Schmid

Black

. Towards understanding action recognition. in: Proceedings of the IEEE International Conference on Computer Vision. 2013; 3192-3199.

Training testing splitting	Number of frames in a bundle (B)	Number of bins for histogram calculation	Number of trees in random forest	Number of rows and column segment in a motion projection matrix $(R,C)$	Accuracy
(b)	5	10	100	10, 15	98.93
(c)	5	10	100	10, 15	97.52