Abstract
For the last three decades human activity recognition has shown a huge technological advancement due to less expensive RGB-D cameras and the increase in the large volume of video data. As a result of the increase in number of surveillance cameras, manual annotation becomes difficult and need for automatic recognition and annotation of video arises. In this paper, we introduce a computationally and storage efficient method for recognizing human activities from depth videos and a new frame selection method based on the mean value of motion energy. We extract normal vectors from the points in the boundary curve. Then polynormals are obtained by sequentially attaching the normals from a neighborhood of each of the points in the boundary curve. These polynormals from a spatio-temporal cuboid constructed from the input video and it is pooled to form the Super Normal vectors. These Super Normal vectors are the final feature vectors, which are given as input to the classifier. The classifier used is lib-linear SVM. The results on MSRAction3D dataset show that the algorithm we put forward is fast and the accuracy obtained is comparable with the existing methods. The method which we proposed here gives an accuracy of 88% while taking whole frames and 89.82% when frame selection method is applied. The proposed method is also tested on UTD-MHAD dataset.
Introduction
Now a days due to the invention of new cost effective cameras and the emergence of internet large amount of video data are prepared and shared through the internet or uploaded to social media sites. In order to effectively use these videos annotation is needed. Due to the huge quantity manual annotation is not possible. Also automatic monitoring of surveillance cameras is needed to respond to unusual events. Many of the unusual events require fast and immediate action. So we proposed a new Human activity recognition algorithm which is faster than many of the state of art methods. The aim of human activity recognition algorithms is to classify the videos into several categories based on the activities performed in those videos. Nowadays, computer vision, as well as artificial intelligence researchers, shows a great interest in Human Activity Recognition algorithms. There are a lot of applications for human activity recognition such as sign language recognition [1], human-computer interaction [1], video surveillance [1], content-based video retrieval [2], video Summarization [3] etc. Medical Health care [1] and Social care fields also get the advantage of human activity recognition algorithms. In medical and healthcare fields, we can use these algorithms to detect the fall of elderly people who are staying alone at home [4] and to monitor the activities of patients. In social care fields, we can detect the abnormal activities in public places like metro stations and air terminals.
Based on the difficulty level, we can categorize activities into four as (i) Gestures, (ii) Actions (iii) Interactions (iv) Group activities [5]. The motion of the part of the body comes under gestures, for example, moving the leg, raising the hand, etc. Actions are performed by single persons and are a combination of several gestures. Eg: waving, walking, punching, etc. Activities like kicking a ball, fighting between two persons, etc come under activities where the person interacts with the object or interacts with another person. Group activities are activities performed by two or more people for example group of person fighting, playing football, etc.
For every computer vision algorithms, the initial step is data acquisition. Sensor-based and Vision-based are the two categories of Human Activity Recognition algorithms based on data acquisition methods. In the first category, the person has to wear sensors. The sensors may be accelerometers, gyroscopes or goniometers. In the second category, actions are recognized from the videos collected from the visible light camera or the depth camera. The videos recorded using the normal visible light camera is affected by noise, illumination variation, etc. In order to solve those issues, we can use the depth camera. The three main benefits of depth camera are (i) it provides the 3D structure of the image which eases the recovery of postures and recognition of the activity. (ii) It can work even in darkness; because of the infra-red light used (iii) It is cost effective. Due to the above advantages, researchers are now showing greater interest in depth videos than normal RGB videos. In this paper, we present a computationally efficient method for human activity recognition based on Super Normal vectors extracted from the points in the boundary curves.
Section 2 brings out, various techniques for Human Activity Recognition and related works in human activity recognition. Detailed explanation about the proposed methodology is given in Section 3. The contributions of the proposed work are also highlighted in this chapter. The evaluation format, the dataset used and the outcome of the experiment is furnished in Section 4. This chapter also discusses the evaluation measures, quantitative and qualitative analysis of the proposed work also. In Section 5 we wind up the report by highlighting the advantage of the proposed method and points to the future scope of Human Activity Recognition.
Literature Survey
Traditional Video Descriptors based on Optical Flow and Gradient for representing RGB videos cannot be suitable for representing the depth video and hence cannot be used for classification of depth videos. So algorithms that are designed for the recognition of Human Activities from depth videos are based on skeleton joints, projected depth maps, cloud points, Normal Vectors, etc. Lu Xia et.al in [6] presented an algorithm for human activity recognition based on spatio-temporal interest points and depth cuboid similarity features. In [7] Eweiwe et al. discussed the relative location of the human joints, their correlation and velocities are utilized for the recognition of activity. In this method, joint features are extracted and converted to the spherical coordinate system to make it more robust to foreshortening effect. These features are then added to the histogram. In cross subject test they got an accuracy of 89.3% on MSRAction3D dataset. Several recently developed methods utilize the advantages of Normal Vectors which provide information about shape and structure. In [8] Oreifej et al. discussed an activity descriptor based on the Histogram of Normals in his paper. The temporal order is also encoded to achieve a satisfiable recognition accuracy of 88.89%.
In another Human Activity recognition method [1], extended surface normals of cloud points from a 3×3×3 neighborhood are concatenated to form the polynormals. The polynormals are encoded using an online sparse coding technique. They used a smaller dictionary of size 100x81. Then they extracted polynormals from an adaptive spatio-temporal grid constructed based on the motion energy and perform spatial average and temporal max pooling to get the Super Normal vectors. This pooling stage is a time-consuming step in this algorithm. The computed super normal vector is classified using LIBLINEAR SVM with a testing accuracy of 93.09% on the MSRAction3D dataset. Yang et al., in [9] replace the adaptive spatio-temporal grid with adaptive spatio-temporal cell and obtained an accuracy of 93.45%.
In [10], Chengjin Li et al. extended the concept of Super Normal vectors to get a better result. They computed the polynormal by concatenating 3 levels of surface normals. The bottom layer contains normals from the 3×3×3 neighborhood, the middle layer contains normals from the 2×2×3 neighborhood and the top layer contains normals from the 1×1×3 neighborhood. Chengjin Li et al. [9] uses a more discriminative sparse dictionary learning technique called the sparse constraint dictionary. So they achieve an accuracy of 94.55%. 3D points clouds based method for human activity recognition [11] achieves an accuracy of 91.64%. In this method, 3×3 Eigen values are taken at each point. Histogram of Oriented Principal components (HOPC) is a series of eigenvectors arranged in the descending order of their Eigen values. In [12], Wanqing Li et al. used bag of 3D points in the recognition of Human Activity.
A fast method for human activity recognition is introduced in [13] where, the depth motion maps are constructed by gathering the motion energy of projected depth map along three planes – front, side, and top. In [14] two architectures are introduced by Rui Yang et al. The first one is based on 2D-CNN and the second one is based on 3D-CNN. In the first method, the complex Human Activity Recognition problem has reduced to simple image classification task by stacking together the Depth Motion Maps of the entire frames. In the second method, a DMM cuboid is constructed to learn more temporal information than the first one. The accuracy obtained by them for the first method is 92.21% for the second method it is only 86.08% on MSR Action3D dataset. In order to reduce the computational complexity and to improve the recognition accuracy, a key frame extraction method was introduced by Cheche Xie et al. in [15]. In this method, frames are extracted based on structural distance similarity using a self adapted search algorithm. Depth Map and Posture data based Human Activity Recognition using deep CNN model, introduced by Aouaidjia Kamel et al. in [16] uses two descriptors - Depth Motion Image (DMI) and Moving Joint Descriptor (MJD). DMI represents the depth map sequence and MJD represents the body posture sequence. Another advantage of this method is that it uses only the front projection instead of the three projections. The accuracy they obtained for the MSRAction3D dataset is 94.51%.
In [17] also Depth Motion Map (DMM) is used for the recognition of Human Activity. The depth video is first projected to three Cartesian coordinate and the differences between the adjacent frames are accumulated as the depth map. The accuracy obtained for the MSRAction3D dataset is 92.3%.
Our proposed method is also based on Super Normal Vectors. We compute the normals for the points in the boundary curve of the person doing the activity. Then we concatenate the polynormals from an adaptive spatio-temporal cuboid to form the Super Normal vectors.
Human Activity Recognition based on Adaptive Cuboid and Super Normal Vector – Proposed Method
Our aim is to design and develop a space and time efficient system for Human Activity Recognition, which can work with depth videos efficiently so that it can be used in real time applications. In [1] X. Yang et al. computed polynormals from the cloud points in the depth sequence. Since each frame contains a large number of cloud points, extracting the normals from the cloud points and concatenating them from a neighborhood is a time-consuming process. In another work [19], J Park and et al have shown that the surface normals from the points in boundary curves are enough for modeling objects. Incorporating this knowledge in human activity recognition, we have reduced the computation time by considering only the points in the boundary curve. Analogous to many other existing human activity recognition algorithms the proposed algorithm also has 3 steps; Data Acquisition - Feature Extraction - Classification
Feature extraction step is again divided into six steps; Compute Motion Energy, Computing the Polynormals, Learning the dictionary, Adaptive Spatio-Temporal Cuboid construction, Aggregating Super Normal Vectors. The proposed method is summarized in the following diagram.

Block diagram of proposed method.
Different people do the same activity in different speed. So it is not flexible to divide the time axis equally. In order to overcome the above-said difficulty, it is desirable to pool the features that has the same activity status. Therefore, we divide the motion energy axis identically into four instead of the time axis.
First, we construct the projected maps by projecting each frame of the depth sequence into three Cartesian planes - front, top, and side. Then thresholding is applied on the difference between the projected maps of the two neighboring frames to form the binary map. Then, we compute the motion energy by taking the summation of the non-zero elements of the binary map using the formula [1]
Depth video can be considered as a hyper surface and this hyper surface can be represented by a collection of points having dimension ’m’. So all the ‘m’ dimensional points in the depth video should satisfy the equation [1]
For depth sequences, dimension is 4, i.e., x, y, t and z. Therefore the value of m = 4. So each cloud point in the depth video should satisfy the equation [1]
Then the extended surface normal can be computed using the formula [1]
The components of the surface normals are gradients along the three directions and a scalar (–1). This scalar makes the orientation of the normal align in the direction of gradient. The distribution of normal orientation is able to provide more information about the geometric cues than traditional gradient orientations. These normal vectors also give information about the motion cues. To preserve the correlation of each normal with its neighbors we form clusters of normals from a 3×3×3 spatio-temporal neighborhood. These clusters of normals are called polynormals. These polynormals are more robust to noise. The polynormals can be represented using the following formula [1]
After obtaining the polynormals, we encode it using sparse coding technique introduced in [21]. In traditional methods, after coding the features the low level features are ignored in the recognition stage. In the proposed method, we are taking the difference between the whitened polynormal and the dictionary word to include the low-level features also. Whitening transformation is done to transform the covariance matrix to the identity matrix. The problem of sparse coding or the basis pursuit can be solved using the formula [1]
A large value of D will lead to a very low value to α. To avoid this condition, the value of dkTdk is maintained to be less than or equal to 1.
The extracted polynormals contain only the discriminative properties required for the classification of the video. So a Spatio-Temporal Cuboid is constructed from the depth sequences and incorporated with the polynormals to include the temporal order and the spatial layout globally. Apart from the traditional methods that use a grid covering the entire frame, we use a smaller grid which is large enough to cover the human in the video.
In order to construct the temporal cuboid, we first normalize the motion energy computed in the previous step in the range [0, 1]. Then a plot is drawn with normalized motion energy along the y-direction and frame index along the x-direction. We, then divide the motion energy axis identically into four segments with the normalized energy in the range [0–0.25, 0.25–0.5, 0.5 0.75, 0.75–1] and the frame numbers corresponding to the energy 0, 0.25, 0.5, 0.75, and 1 are noted as t0, t1, t2, t3 and t4 respectively. A temporal cuboid is constructed by taking four successive segments {t0–t1, t1–t2, t2–t3, t3–t4} of the video based on the motion energy. Our adaptive spatio-temporal cuboid consists of this temporal cuboid and a 4×3 spatial grid.
Aggregating the super normal vectors
For each visual word in the constructed dictionary, we compute the weighted difference between the polynormal and the visual word. We have taken sparse coefficient as the weight. Then we compute the average pooling on the spatial level and max-pooling on the temporal level on these differences [1].
These vectors are termed as Super Normal Vectors and are fed to the classification stage as input.
Traditional classifiers may not show good accuracy in our proposed work because the number of classes and the number of polynormals is very high. So we use a technique introduced in [20] by R. Fan and et al. This method can effectively classify large sparse dataset. The accuracy we obtained is 88%. In order to remove the frames that lead to false classification we have introduced a frame selection method. This method reduces the computational time and storage complexity of the algorithm.
Frame selection
The depth video comprises of several frames, and hence it is possible that many frames may be redundant. In order to find out the irrelevant frames, we have proposed a new frame selection method based on the motion energy.
Indices of the frames selected for the feature extraction is computed using the formula
The proposed method is summarized in the following algorithm.
The contributions of our proposed work is summarized in this section. We have proposed a new Human Activity Recognition Algorithm by extracting Super Normal vectors from Boundary Curves. The computation time is decreased by 2.5 times than SNV [1]. The space complexity is decreased by 1.7 times than SNV [1]. We have proposed a new frame selection technique by modeling the motion energy.
Result analysis and discussions
MSRAction3D Dataset
We have tested our method on the MSRAction3D dataset. The MSRAction3D dataset contains depth videos of one person actions done by 10 subjects facing the camera. It contains 20 activity classes and each activity is done 2 or 3 times. The videos are recorded by placing the camera in front of the subject. There are a total of 567 videos in this dataset. The camera used for capturing the videos in this dataset is the Microsoft Kinect Camera.
Experimental setup for MSRAction3D dataset
For comparing the result with other methods, we have used the same experimental setup as [1]. For training, videos of odd subjects i.e., 1, 3, 5, 7, 9 are taken and the videos of the remaining subjects i.e., 2, 4, 6, 8, 10 are taken for testing. Therefore the train set contains 292 videos and the test set contains 275 videos. Some Activity Recognition Algorithms on MSRAction3D uses three subsets. So we tested our proposed method in these three subsets also. The first Activity Subset (AS1) contains videos of activity classes 2, 3, 5, 6, 10, 13, 18, 20, second Activity Subset (AS2) contains videos of activity classes 1, 4, 7, 8, 9, 11, 12, 14 and the third Activity Subset (AS3) contains videos of activity classes 6, 14, 15, 16, 17, 18, 19, 20. There are 226, 231, 226 videos in AS1, AS2 and AS3 subsets respectively. The AS1 and AS2 subsets, groups actions with similar movement, whereas, AS3 groups complex actions together.

Confusion matrix of MSRAction3D.
UTD-MHAD dataset contains 27 activity classes, each activity is performed by 4 females and 4 males. Each activity is repeated 4 times by each subject. Similar to the MSRAction3D dataset, the camera used is Microsoft Kinect. The camera is placed in a tripod stand at a distance of 3m from the subject. This distance ensure the visbility of the whole body of the subject.
Experimental setup for UTD-MHAD Dataset
We have used the same experimental setup as [18]. Videos of odd subjects are used for training and videos of even subjects are used for testing. The proposed method is also tested on two subsets with one containing videos action classes 1–12 and 14 to 20 and the other containing action classes 13 and 21–27.
Accuracy of proposed system on MSRAction3D
Accuracy of proposed system on MSRAction3D
Accuracy of the proposed system on UTD-MHAD dataset
The summary of the results is depicted in the following tables. The results reveal that the proposed method is computationally efficient in terms of time and space, comparable with the other existing methods.
Space comparison
We have computed the SNV descriptor for the whole dataset containing 567 depth videos. The space required for storing these descriptors is 1.22 GB for our proposed method without frame selection and 1.20 GB for our proposed method with frame selection, which is very less compared to the SNV [1]. The details of the space comparison is shown in table.
Space comparision of proposed system with SNV [1]
Space comparision of proposed system with SNV [1]
Time comparision of proposed system with SNV [1]

Time comparison.
Since time is a main constraint in Human Activity Recognition algorithms, we have measured the time consumed by our proposed method. The total time required for our proposed method without frame selection is 228 minutes and for our proposed method with frame selection is 134 minutes approximately. The time required for computing the motion energy is not included for comparison. The time is measured on a system having Intel i7 processor and 16GB RAM. The time consumed by the major steps are shown in table.
Our aim is to recognize Human Activities faster without sacrificing the accuracy. For our proposed method without frame selection the accuracy is only 88%. When the frames having low motion energy values are removed we got a higher accuracy of 89.82%. The accuracy we obtained is 0.36% less than the SNV [1] when using sampled cloud points but we have reduced the computation time 1.7 times than SNV [1]. Table compares the accuracy we obtained with two other methods on MSRAction3D dataset.
We have proposed a fast and storage efficient method for recognizing human activities using the super normal vectors extracted from the boundary curves of human doing the activity. The objective of this work is to present a method which can recognize human activities from super normal vectors extracted from depth videos with minimum time and space complexities. In this paper, we have proposed a frame selection technique based on the motion energy. In this work, we adopted an adaptive spatio-temporal cuboid which can handle the time variations of actions in the same category effectively. The proposed method when applied on MSRAction3D dataset gives the result in very less time than the existing methods while consuming very less memory, proving that time and space complexities are substantially reduced by the proposed method.
