Abstract
This paper proposes a method to human action recognition from RGB video clips. The method is based on capturing the local motion information from smaller size video clips. Local motion information is captured through accumulation of motion in different shape and size of patches of spatial domain. The motion information is then transformed to motion histograms. Further, all the histograms are concatenated to make the proposed feature vector. Bagging ensemble technique, in form of random forest, is used for classification. The idea is further extended to real time human action recognition mechanism. To show the robustness and efficiency of proposed algorithm, it is performed on publicly available human action datasets Joint-annotated Human Motion Data Base (JHMDB) [29] and University of Rzeszów (UR) Fall detection dataset [19]. The results are also compared with other state of art methods.
Introduction
Recognizing Human actions from video has been an active research area for years. It is a challenge to detect human actions in video streams due to variations in pose, different camera calibration, wide variety of (loose and fit) apparels, clutter background and various lightning conditions. Camera movement makes it even more difficult. Potential applications includes elderly and patients care, kids daycare, group monitoring, sports training, surveillance, and abnormal behavior detection. Image based features have made considerable advances in recent years [1, 2, 3], but they are too mature to many practical applications. On the other side, certain kinds of movement are humans characteristic, so classification accuracy can potentially be improved by considering motion information. Many researchers in this area assumed that the camera and the background are essentially static. This greatly simplifies the problem because the mere presence of motion already provides a strong cue for human presence. For example, Srivastava and Biswas [4] captured the motion in form of histograms and it is used for human action detection. Henawy et al. [5] presented a process for multidimensional time evolving data modeling and classification. The authors are proposed a stabilized higher order linear dynamical system for classification. Bag of words approach is used for classification of multidimensional signals which could be video signals as well. Ganapathi and Prakash [6] presented a global 3D descriptor for human recognition. The proposed descriptor is calculated by concatenating the histograms of point clouds in specific annular region. Song et al. [7] proposed a learning process with the help of regularized cross-entropy for action recognition. The authors also proposed a process for action temporal proposals generation for action detection. Zhang et al. [8] proposed a 3D histograms of texture for feature extraction from depth data. A classifier is also proposed named multi-class boosting. Chen et al. [9] presented a descriptor, which is calculated in three stages. On first stage depth motion maps is calculated. Texture histograms of depth patches are calculated in stage two while compact representation of features are extracted with the help of Fisher kernel in stage three. Colque et al. [10] presented a descriptor Histograms of Optical Flow Orientation and Magnitude (HOFM). The HOFM is used to detect anomalous activities in videos. Khan et al. [11] presented a model to detect human fall when fall data is not available but sufficient data is available for ADL. Wearable devices and X-factor hidden Markov model (XHMM) are used in proposed model. Sanaz et al. [12] proposed an approach for detecting human fall in industrial environment without wearable devices. The authors used human induced diffraction for fall detection. Casilari et al. [13] present a survey on available wearable sensor approaches and datasets to detect the same. Chun and Lee [14] worked for multi camera view for human action. They proposed an optical flow based histogram for human action recognition. Zhang and Parker [15] proposed CoDe4D features for action recognition. The CoDe4D is extracted, from RGB-D data, as multi-channel orientation histogram. To make the method view independent, Dogan et al. [16] proposed 3D volume motion templates (VMTs). A rotation with respect to a canonical orientation is made by the authors. Luo et al. [17] proposed to model the motion dynamics parameters as motion descriptor for action recognition. The model is designed with histograms of oriented gradients and robust linear dynamical systems. Zhou and Zhang [18] used multiple-instance formulation to find elementary actions with stable states. Encoding of local parts movements of human action is proposed. Kwolek and Kepski [19] used accelerometric data and RGB-D data for detecting human fall. The authors track the accelerometric and depth data for potential human fall. And whenever it crosses a set threshold, it is verified for fall detection in synchronized RGB data. Cheng et al. [20] presented a framework for activity recognition. The framework is using surface electromyography and accelerometer signals for activity recognition. To detect the boundaries of the activity the authors used histogram of negative entropy. Mukherjee et al. [21] used histogram of oriented optical flow and a bag of word approach to compute the descriptor. A graph theoretic technique is proposed for human actions recognition by the authors.
In this paper we describe our endeavor to propose methods for efficient human action recognition in real time. Rest of the paper is designed as follows: Section 2 consists of some related work in the area. The proposed methods are discussed in the Section 3. In Section 4, there are descriptions of our experiments, datasets of interest and real time implementation details of proposed method. Section 4 also contains results comparisons with other state of art methods. The conclusion is given in Section 5.
(a) Video volume’s frames for pour activity, (b) transformed video sub volume’s frames and (c) MPM for one sub volume of frames.
Colque et al. [10] presented a model for grabbing anomalies in human actions. The authors used orientation, velocity, and entropy to compute optical flow histogram as a feature. Martínez et al. [22] presented an algorithm for online action recognition and shown the average accuracy on per frame basis. Flow orientation histograms of variable but small sized and overlapped patches for each frame is calculated in the proposed algorithm. SVM is used for action classification. Ni et al. [23] introduced a framework for Motion Part Regularization (MPR). The framework is used for mining dense trajectories discriminative groups. The representative group is used for action detection. Ma et al. [24] designed a hierarchical spatio temporal trees. These trees is used for making vocabulary of similar actions for partial or complete human motion. The vocabularies are used for classification. Fernando et al. [25] enumerated human actions in chronological order by exploiting temporal ordering in videos. The enumeration is done using ranking learning framework for recording of significant information. Zhang et al. [26] proposed a noise immune learning framework for human action recognition. The framework is designed with the help of relative attributes of human action in videos. Vrigkas et al. [27] presented a learning-based framework. Gaussian mixture model is used for clustering of motion curves and longest common subsequence is used to calculate feature vector. Viola et al. [28] improved a pedestrian detection system by improving its performance in low resolution camera and bad weather conditions. To increase the system performance, Viola et al. [28] used a detector by combining appearance and motion.
The researchers (of human action recognition area) have contributed a lot in terms of proposing various descriptors and classifiers. There are still many scope for further work because results are not at par in practical scenarios. In this paper we present a simple but effective motion histogram based approach for detection of motion along vertical and horizontal patches of video sub volumes. This helps in increasing immunity of feature vector with respect to the relative placement of camera with object or actor. The detail descriptions are given in following section.
Proposed method
The activity video clips in various datasets are of different length in nature due to following reasons (i) various actors performed an (same) activity with different speed, (ii) various activities are of different length in nature.
So every video clip is divided in to fixed and equal length video sub volumes. That is, if a video volume is divided in to
where
The motion between two consecutive frames for one pixel is calculated by Eq. (2) and the process will repeat for all the pixels, so the size of
where
There could be multiple ways to divide the MPM in to patches. Patches could be of fixed or variable size, could be of different shapes. But to make proposed algorithm computationally economical fixed sized and rectangular shaped patches are chosen. Splitting MPM in to grids patches, as shown in Fig. 2a, would record the motion in very specific region. Small area specific motion recording would lead to over-fitted classifier.
So, we propose, the MPM division into equal sized C vertical (or column) patches and equal sized R horizontal (or row) patches independently. Every vertical patch is immune to differentiate upper region motion with lower region motion while it (vertical patch) is sensitive to differentiate between motions in left region with motion in right region. The same would be true for horizontal patches by swapping upper and lower to left and right. Now vertical patches recording the motion which is sensitive to left and right region while immune to the motion performed in upper and lower region of video sub volume. And horizontal patches recording the motion which is sensitive to upper and lower region while immune to the motion performed in left and right region of video sub volume. Another challenge to this approach is not to differentiate between similar motion performed “near to camera” and “away from camera”. But the proposed approach would also take care of the mentioned challenge due to the reasoning explained above.
Combining these two types of motion information (horizontal and vertical patch motion) as a feature vector would intuitively lead to train a classifier neither under fitted nor over fitted manner. The intuition is reflected to be correct in Section 4 of this paper.
Segmenting 
Motion storage in a typical MPM is done in form of pixel intensity. That is a black pixel intensity value represents no motion at the pixel location while gray and white pixel intensity values represent low and high magnitude motions respectively at the pixel location. The motion information in a single patch is stored in form of pixel intensity values. The motion information is converted to intensity histogram with B number of bins to make the size of feature vector fixed and smaller in comparison to if it were stored in form of raw pixel values. The histograms for horizontal patches are represented by
The proposed feature vector
MPM’s horizontal and vertical patch splitting with histogram name indication.
In Fig. 3 horizontal and vertical patches of a typical MPM are depicted. It also contains the used nomenclature of histograms.
Most of the available human action dataset having multiple action classes. So any deputed classifier for human action must having property of multiclass classification. Multiclass classification behavior could obtained by combining multiple binary classifier (one for each class) with one class versus all other classes approach. But the approach would lead to imbalance data from a binary classifier point of view because almost every available human action dataset have same number of video clips for every action of the dataset.
Random forest is a multi-class classifier having multiple decision trees. Every tree of the forest is trained by some (randomly chosen) features from the set of all features. To decrease the correlation between such trees, bagging (Bootstrap Aggregation) is used in random forest. After training of random forest, when a test class comes before random forest, it collects the decision (or vote) of all decision trees and the class which received maximum votes, declared as output class of the random forest.
So we chose random forest for human action classification in proposed method. Publicly available datasets named JHMDB [29] and UR Fall detection [19] have been chosen for evaluating our proposed method. We further chosen classification accuracy as an performance measure of our experiments. The classification accuracy is calculated as in Eq. (5)
The JHMDB dataset consists of video clips of 21 different human activities. The dataset consists of more than 30 clips per action class and each clip containing 15 to 40 number of frames. The dataset is composed of video clips from different sources like movie clips, youtube videos and google videos. The description of all the activities and dataset complexity in terms of body part visibility (in video clips) are shown as horizontal bar chart in Fig. 4.
Action of HMDB [29] dataset and video distribution on the basis of body parts visibility of actors.
Classification accuracies of the experiments on JHMDB dataset on different parameter values
The videos are recorded with different types of cameras, resolutions, lighting conditions and unconstrained environment. Occlusion is also presented in videos with different magnitudes. Figure 5 depicts some of the actions of JHMDB dataset, columns 1–4 of the figure showing catch, jump, wave and push activity respectively. While rows of the figure show snap shots of actions at two different time instances.
Comparison of proposed method with Jhuang et al. [29] methods
Snapshots of catch, jump, wave and push actions of JHMDB dataset in column wise manner. Upper and lower snapshots of every columns are taken at two different time instance.
We have performed various experiments on the JHMDB dataset for different values of
Experiments are enumerated in order to their accuracies. Highest accuracy is achieved when the parameters are taken as shown in experiment 1. Detailed result discussion for JHMDB in the rest of the paper is done in context of experiment 1. The level of challenge in JHMDB is quite high due to its variability. For example there is no restriction on camera movements, camera angle, distance of camera with actor and angle of light sources with respect to object.
Activity wise classifier accuracy for JHMBD dataset.
The proposed algorithm worked well on challenging dataset JHMDB for many activities while some activities are significantly misclassified. Pull-up and shoot bow activities are classified with significantly good accuracy of 89% and 83% respectively. There are four more activities which have been classified with accuracy more than 70%. Activities of sit, run, walk andkick ball are significantly misclassified. The reason for significantly misclassification could be high degree of closeness among some activities or high degree of occlusion etc. The overall classification accuracy 51.75% is achieved for proposed algorithm. Activity wise classifier accuracy is shown in Fig. 6. To increase the visibility of accuracy graph in Fig. 6, the activities are being renamed as brush hair activity is renamed as ‘a’, cart wheel activity is renamed as ‘b’ similarly catch activity is renamed as ‘c’ and so on. Activity sequence is taken same as depicted in Fig. 4.
Experimental results for optimal parameter values on UR fall dataset
Frames of Fall from Sitting, Fall from standing from UR Fall Dataset, while Bend and pick, Look under furniture (column wise) activities of ADL dataset.
Results of proposed method are compared with other state of art techniques in Table 2. It can be seen that our approach is performing better than all the histogram and trajectory based approaches of Jhuang et al. [29].
This dataset contains 70 clips, in which 30 clips are of two types of fall activity, one is fall from standing, second is fall from sitting. The remaining 40 clips are ADL (activity of daily living). ADL includes bend and pick an item, look below the furniture and sit on chair activities.
Every activity is performed once or twice by 5 persons. The ADL activities are partially similar to fall activity so it is a challenging dataset to detect fall activity in query clip. Figure 7 depicts some of the activities of UR-Fall detection and ADL dataset, every column represent an activity (Fall from Sitting, Fall from standing, Bend and pick, Look under furniture). First row of the figure shows starting gesture of the activity while second row represents one of the last gesture of the activity.
As per our observation, falling is an instant/short duration activity (less than a second), which involves quick movement of an object in a very short duration. Detection of the high magnitude movement in short duration could be distinguished from low magnitude movement in ADL. Real time Fall detection could serve a meaningful purpose in the real world scenario. For instance, one could raise an alarm on detection of a falling action.
Using our scheme, we could sense if contiguous bundles are being labelled falling, and raise an alarm in real time. In our experiments we found that sensing three continuous fall detection in motion projection matrices are sufficient for accurate detection of fall activity with minimum false positives.
In the UR-fall dataset we manually label the frames as falling in which actually falling is observed, and the rest of the frames are labeled as non-falling in each clip. Further we apply proposed algorithm to calculate the feature vectors as per Algorithm 1 and the whole process is depicted in Fig. 9. To show the effectiveness of the proposed algorithm, three different data-splitting has been chosen for training and testing purpose. In all the following data configurations 70%–30% splitting are done for training-testing.
Training and testing from UR fall dataset only Training from UR fall dataset and testing with ADL dataset Training and testing from UR fall dataset and ADL dataset collectively
To choose the optimal value of parameters (bundle size, number of bins, Number of trees in random forest, number of horizontal and vertical patches) in the proposed algorithm, some experiments are done on UR fall dataset as (a) splitting. The results are shown in Table 3. It is observed that as the number of horizontal and vertical patches are increased in Motion Projection Matrices (MPM), accuracy drops. Similarly increasing the number of bins for histogram calculation also decreases accuracy. In the fall detection scenario, background and camera is static while the motion is only due to object under observation, so higher value of parameters lead to dipping in accuracy.
Accuracies of proposed algorithm under splitting (b) and (c)
Comparison of proposed algorithm accuracy with [19] for fall detection and not-fall detection
For splitting (b) and (c), parameter values are chosen as per experiment number 1, except the number of trees in random forest in Table 3. The number of trees is increased to 100, because size and variety of training samples are increased in splitting (b) and (c).
In splitting (b) training is done by UR fall detection dataset only and testing is performed on ADL dataset which have activities close to falling but not actual falling. The accuracy (98.93%) shown by proposed algorithm is significantly acceptable level. It also shows very low false alarm about falling, which is also an important aspect of fall detection system.
In splitting (c) all the activities from both the datasets (UR Fall and ADL) are divided in to 70–30 ratio for training and testing purpose. The accuracy achieved in this configuration is 97.52 which is slightly lower than previous (c) splitting. It is a bit lower because the classifier is trained for additional activities (from ADL) which was not the case in splitting (c).
Processing time (in seconds) for every MPMs for UR fall detection and ADL dataset.
Kwolek and Kepski [19] prepared the UR fall and ADL dataset. They also proposed a system for fall detection using synchronized accelerometer, depth and RGB data. For depth data, a MS Kinect or similar device is required and for accelerometer data, object have to wear a device. Wearing a device is an overhead for elderly people.
We compare accuracy of fall detection and non-fall detection of proposed algorithm with [19] in Table 5. In fall detection section (of Table 5), proposed algorithm is performing much better with respect to [19] when only RGB-D data is used. The performance of proposed algorithm is slightly lower than [19] when accelerometer data along with RGB-D data is used. Further on non-fall detection section in the table, when only ADL data is used for testing purpose, accuracy of the proposed algorithm is comparable. The accuracy is comparable because proposed algorithm is using only RGB data while the [19] is using RGB-D data along with accelerometer data.
CCTV cameras normally maintain its recording at 10–15 frames per second, which means at every 0.07–0.10 seconds a new frame arrives. So any real time algorithm should keep its average processing (per frame) time below 0.10 seconds.
Proposed real time algorithm is implemented in Matlab R2015a, on operating system Windows 7. Hardware specifications are 2 GB RAM, Intel i3 2.20 GHz processor. To show the usefulness of the framework in real time processing time is calculated. The processing time is calculated as sum of time of calculating feature vector for one motion projection matrix MPM, and time elapsed in classifying a query class. The equation is shown below as Eq. (6).
Processing time (per MPM)
Flow chart for real time fall detection.
Processing time per MPM (in second) is depicted in Fig. 8. The
This paper proposed a for human action recognition from RGB video clips. Firstly motion was accumulated in smaller video clips in form of motion projection matrix (MPM), then MPM is divided into horizontal and vertical patches independently. Every patches are converted to histogram followed by concatenation of all histograms to make a feature vector. Bagging ensemble technique, in form of random forest, is used for classification. The performance is evaluated on publicly available human action datasets JHMDB [29]. We also extend the idea for real time performance on fall detection. The real time algorithm is presented in the paper. For evaluation purpose of proposed algorithm we performed classification (fall detection) on UR fall dataset and compare the results with other state of art methods. It is seen that performance of our scheme, based on just RGB data gives comparable results to [19] who employ RGB-D and accelerometer data.
