Survey and analysis of human activity recognition in surveillance videos

Abstract

In computer vision, recognizing human activity or behavior is a core challenging problem. This article provides a crisp study of human activity recognition systems in the area of visual surveillance. These systems are used for analysis and understanding of the human behavior. The study starts with the description of various emerging video processing domains, followed by a general process of human action recognition. Then the article covers human detection techniques from images and video. Finally, article also provides a survey of different features and models used in activity recognition systems and an overview of benchmark dataset of video surveillance. From this state-of-the-art survey, researchers can outline promising directions of research.

Keywords

Visual surveillance Bag-of-Visual-Words human activity recognition motion features

1. Introduction

Visual surveillance is becoming vital for protecting people and property in the recent era when crime rates are increasing [15]. As high-quality cameras are now available easily, many surveillance cameras are already installed around us, but there is a lack of manpower to look after continuous activities happening 24 hours a week [22]. Moreover, such surveillance system produces a lot of video data which leads to increased storage requirements [18]. This storage requirement can put an additional financial burden. Hence, we require an efficient visual surveillance system, which helps to reduce large manpower, human errors, and storage cost [6]. Such problems motivate us to study visual surveillance among other emerging video processing domains like video summarization, video retrieval, video labeling, video clustering, and video classification. In this article, video surveillance or visual surveillance terms are used interchangeably.

Surveillance cameras can be a far more useful if instead of passively recording footage, they can be used to detect events requiring attention as they happen, and take action in real time. In most of the surveillance systems closed-circuit television (CCTV) cameras are used. Thermal cameras, which are also called infrared cameras, are used for night surveillance. For real-time surveillance, we require live streaming of data, which needs to be transformed in form of the frame from surveillance device to the processing device. In addition, the optimal frame rate for any surveillance can be based on criteria, like a scene under surveillance, movement in relation to camera frame, camera sensor features, and available light. The preferred standard frame rate is 25–30 frames per second (fps), which requires a bandwidth of 7–8 Mbps [45] for data transfer. As we reduce the frame rate, less bandwidth is required for processing.

Moreover, the video process of surveillance systems has inherited difficult challenges while approaching a computer vision application, such as occlusion (overlapping region) handling, poor illumination or luminance, camera calibration, sensor noise, and dynamic background changes [6, 8, 22]. We can handle poor illumination or luminance and noise problems in data using more reliable features.

A few surveys exist on Human Action Recognition (HAR). However, the surveys do not suffice all the aspects of HAR systems to neophytes. Hence, we write this article to provide soup to a nuts survey of human activity recognition methodologies for person detection, features identification and models used for HAR. Next, we mention important situations in which this work can become a crucial guide for conducting research: (i) Choosing and understanding interrelation of a particular video processing domains. (ii) Understanding of exhaustive taxonomy of HAR. (iii) Deciding about which HAR system classification to work on based on the number of people involved in the activity. (vi) Choosing the best person detection technique from image or video data. (v) Deciding the best approach for action recognition. (vi) Selecting the best video processing technique for each step of HAR. (vii) Understanding of feature detector-descriptor approach thoroughly. (viii) Selecting the appropriate activity classification model. (ix) Choosing the best tool and libraries for HAR system implementation. The researchers who have to satisfy formal requirements, the extensibility, availability of source code, proper documentation, and many other criteria are important for making the final decision. (x) Deciding datasets for either controlled or uncontrolled environment based on different criteria, like the number of actor involved, resolution, scenario considered etc.

This article is divided into ten sections. The article present a survey on various video processing domains and a general process of HAR in Section 2.1; Section 2.2 presents survey work done by researchers for action recognition. In Section 3, we presents the taxonomy for Human Action Recognition (HAR), which provide overall idea of survey carried out. In Section 4 we presents a Human action Recognition taxonomy. Subsequently, the article presents the human detection techniques in Section 5. In addition, Sections 5.1 and 5.2 provide survey of techniques used by researchers to detect human from images and video data, respectively. Moreover, Section 5.2 considers various techniques for the video data, which are captured from either static and/or moving cameras. From which one can select best person detection technique for their work environment. Section 6 presents survey and analysis of HAR. Specifically, Sections 6.1 and 6.2 present a survey of 15 selected papers covering major problems applied in HAR using video processing techniques and 3-D space-time volume features, respectively. Above all, Section 6.3 presents the types of extensively surveyed features used by researchers for HAR. In addition, article also presents feature detector-descriptor approach used for activity recognition. From this knowledge naive researcher in this domain can understand the HAR system classification and can choose best features for their system. Furthermore, the article present a survey of activity classification models in Section 7. We present survey and analysis of tools and technologies available for video processing, along with action video dataset for controlled and un-controlled environment in Section 8. Conclusively, Section 9 presents proposed work and methodology based on our findings and Section 10 summarizes the work in form of conclusion.

Table 1
Interrelated video processing steps in various related domains

Video processing domains	Video processing steps						Additional steps required
	SU	OI	OT	FE	BoVW	MG
Video surveillance							Activity modeling or anomaly detection
Video summarization						–	Shot clustering, video skimming, highlights extraction
Video retrieval			–				Video annotation
Video labeling					–	–	Video indexing
Video clustering			–				Similarity computation, clustering of key frames
Video classification			–				Key frames extraction
Video compression	–			–			Quantization, run length encoding, entropy coding

Figure 1.

Types of emerging video processing domains.

2. Background knowledge and related work

In this section, we present overall gist of video processing domains.

2.1 Background knowledge

This subsection describes importance of human action recognition (HAR) followed by the general process of HAR.

2.1.1 Video processing domains

The research community is pursuing research in many video processing domains. These emerging domains are shown in Fig. 1.

The first domain is video surveillance, which is the means of watching the video over electronic equipment such as closed circuit television (CCTV). A great implication of video surveillance is for monitoring the behavior and activities of people, basically for protecting, directing, and managing them. Nowadays, surveillance is used by government firms at many places for public safety, which lends a helping hand to the surveillance authority to detect criminals and provide adequate evidence to closely surveil. For example, in Indian Railway, video surveillance is important security requirement to be provided at waiting hall, railway yard, reservation counter, parking area, platforms, main entrance/exit etcetera of the railway station to capture images and examine human behavior. Proprietary firms also provide surveillance for the prevention of crime, investigation of crime and gathering of intelligence [20].

The second domain, video summarization is a process of presenting and creating a concise view of entire video. This technique aims at extracting features from frames and then clustering the features in order to group the frames with similar content. The most representative sample from each cluster is selected as key-frame, which shall compose final video summary. The Third important domain is video retrieval [40], which is a problem of retrieving a relevant video from a large video collection over world wide web. It is a content-based visual information retrieval (CBVIR) problem. Mainly, video retrieval is a two-step process, where the first step is to extract representative features from video frames and then to find an appropriate homogeneity model to select identical video frames from the whole collection of videos.

The fourth domain is video labeling or annotation, which is the process of adding semantics and descriptors to the video contents to enrich the result of video search. Due to the explosive growth of video data, it is very much required to retrieve and access the desired video data. Hence, video labeling is becoming an essential step for many computer vision applications to annotate the data automatically rather than manually, which requires intensive labor cost. Next two domains are video clustering and classification. Both methods are used in pattern classification. Video clustering is an unsupervised learning method where grouping is done based on the similarity, which can predict the cluster of any new sample. Whereas, video classification is a supervised learning method which requires labeled training set with examples from each category. Based on similarity to the parametric model, decides a class of a new sample. Next domain is video compression. Video compression methods (codecs) orchestrate video signals which can dramatically minimize the storage and bandwidth by discarding insignificant information.

There are other video processing domains which are also popular in research community such as Video steganography, which is a technique for secure data transition; Video mosaicking for transferring maximum information from video frames into a large view image; and Photosynth, which is a technique to combine a collection of images from different viewpoints into an extended panoramic image. However, review of these domains in detail is out of our scope of the article.

Although the objective of each video processing technique is different, there could be some common intermediate processing steps. To emphasize similarity, we present Table 1 showing various steps of different video processing domains. We have identified the following processing steps across different domains: scene understanding (SU), object identification (OI), object tracking (OT), feature extraction (FE), Bag-of-Visual-Words (BoVW), and model generation (MG). For each, video processing domain, common steps are marked with tick. We have also presented additional steps required for each domain in a separate column. Using the information in Table 1, a researcher can understand interrelation among the domains and can apply the knowledge into another domain. Out of these video processing domains, our focus is on visual surveillance.

Figure 2.

General process of human (object) action recognition.

2.1.2 General process of human action recognition (HAR)

In general, a video processing task follows four steps, namely object segmentation, object classification, object tracking, and activity recognition, in Fig. 2. We now describe each step in detail underneath.

Object or motion segmentation

In computer vision, object segmentation deals with identifying moving objects like human, bird, vehicle or animal from the video data. Moving objects can be extrected by motion segmentation. Few approaches to motion segmentation are background subtraction [2, 6, 9, 12, 19], foreground extraction [12], temporal differencing [5, 18] or using optical flow [1].

Object classification

The second step of HAR is object classification, which is the process of classifying the object of interest. In video surveillance, object classification is carried out using shape-based [1, 2, 8], color-based [14], and motion-based techniques [6, 8, 15]. Researchers have categorized object or motion segmentation and object classification as a low-level vision problem.

Object tracking

Object tracking is the process of estimating path followed by the particular object in the image plane. This path is a trajectory of the object. Object tracking can be difficult due to illumination changes, complex object shape, and full or partial occlusion etc. Many well known trackers are used like Mean-shift tracking [2], Kalman filter [6, 9, 18], KLT tracker [3, 15], Optical flow [1, 8, 17] etc. Object tracking step is categorized as intermediate-level vision problem by researchers.

Figure 3.

Taxonomy of human activity recognition (HAR).

Activity recognition

In the computer vision, one of the most challenging tasks is behavior learning of moving objects and understanding of activity from the visual surveillance. Hence, this step is categorized as a high-level vision problem. The research in this domain mainly focuses on detecting endangering human behavior or suspicious human activities. If an abnormal behavior is detected then an alert message can be generated.

2.2 Related survey work

This subsection presents survey works that has been done by researchers for different domains such as action classification, abnormal activity detection, and types of datasets available in visual surveillance.

Chaquet et al. [21] have presented a survey on video datasets available for human action and activity recognition. Their work contains a precise survey of controlled and uncontrolled datasets and video repositories which are available on-line. They have considered different characteristics of datasets like a number of actions, the number of views, indoor or outdoor scenes, static or dynamic camera movement etc. Ko [22] have reviewed on behavior analysis in video surveillance for homeland security. Their paper briefly described techniques for object and motion detection, object classification, object tracking, and motion extraction techniques. In their work, Section-VII the person identification techniques are missing. However, they have discussed major existing behavior understanding methods.

Significant application domains are discussed by Ke et al. [24] and have presented a review on video-based HAR. They have made a survey on object segmentation, feature extraction, activity detection, and action classification techniques. Poppe [33] also presented a survey on vision-based HAR. In their work, they have discussed the challenges and characteristics of the video processing domain and common datasets. The authors have also reviewed different image representation and action classification techniques.

A review of intelligent multi-camera video surveillance is presented by Wang [35], which covers multi-camera calibration, tracking, and activity analysis. The author has worked on re-identification of the same person and presented a survey on object re-identification techniques, followed by HAR in multiple camera views. Vezzani et al. [36] have presented a survey of people re-identification in Surveillance and Forensics. The authors have discussed work done in people re-identification, machine learning with the application scenarios, followed by datasets available for people re-identification.

Dawn and Shaikh [39] have presented a survey on HAR with spatiotemporal interest point detector. They have reviewed components of STIP-based HAR techniques, followed by some popular features detectors and descriptors. Then they have presented performance comparison of STIP-based methods on Weizmann [51] and KTH [50] datasets, and also shown a brief overview of related datasets available for HAR.

3. Taxonomy of HAR

Figure 4.

Classification of human activity recognition (HAR).

We present an overall glimpse of the survey that we carried out in Fig. 3. Mainly, our survey is branched into five parts namely, Types of Activities; Human Detection Techniques for Types of input; Approaches for HAR; Classification of Models; and Tools and Video dataset.

The HAR systems are classified based on the number of people involved in activity: Single Person Activity, Two people interaction or crowd behavior, and Abnormal activity. The survey carried out for Types of Activity is present in Section 4. One of the foremost step in HAR is to detect a human from the image or video data. The survey of different techniques for human detection is carried out in Section 5.

There are different approaches followed by the researchers for activity recognition task: Using video processing techniques, feature-based method, and body modelling. The first approach is using background modeling and tracking the bounding box region. In Section 6.1, we present an extensive survey of work done in HAR using various video processing techniques. The second approach is a feature detector-descriptor approach for finding a trajectory of a person using local sparse features from 3-D space-time video volume. Spatio-temporal interest points (STIP) are extracted from video data. We present a concise survey of work done in HAR using STIPs from 3-D space-time video volume in Section 6.2. The third approach is a modeling a 3-D human body shape and identification of human activities using wearable sensors. Activity recognition using 3-D human body shape modeling and wearable sensors give good accuracy, but the computation of 3-D joints is very tedious and use of sensors incur extra cost. The detailed survey of Body modelling and wearable sensors is out of the scope of the article. The feature detector-descriptor approach is widely used by researchers because STIPs are proved invariant to rotation, scale, and translation. Hence, the detector-descriptor approach provides better accuracy for action recognition. The survey of detector-descriptor approach is carried out in Section 6.3.

To classify actions, different classification models are used. We present the survey of classification models in Section 7. There are different tools and languages which supports vision processing. The survey of tools and dataset is present in Section 8.

4. HAR systems classification based on the number of people involved in activity

Basically, human activity recognition systems are classified into three categories, which is prepared and shown in Fig. 4.

4.1 Single person activity recognition

In this subsection, we present an overview of various types of single person activities recognition.

Trajectory:

A trajectory [43, 44, 45] is a path that a person follows as a function of time. The trajectory of a tracked person in a scene is used to analyze the activity or behavior of the tracked person. Wang et al. [30, 34] have used dense trajectory feature to find out human action and shown promising results. We describe trajectory in more detail in Section 6.3.1.

Falling down detection:

Another well-known topic of single person activity recognition is the fall down detection [2, 9]. Falling detection is essential for security and safety environments, mainly for the elderly who live alone at home and also for child-care systems. Yang et al. [2] have worked on home safety. In 2016, Zulkifley et al. [9] have extracted unique features that can detect falling down and aggressive behavior.

Human pose estimation:

Human pose estimation [5, 8, 18] is also a popular topic in the computer vision research community. The estimation of human postures like standing, sitting, or sleeping etc. convey significant information to find out human activity (e.g. full cross legs posture in a series of frames defines running activity). Sivarathinabala and Abirami [8] proposed work for human pose recognition system.

4.2 Two or multiple people interaction and crowd behavior

Two or Multiple people interaction and the crowd behavior [3, 8, 11, 13, 18, 19, 30, 34, 36, 40, 41, 43, 47, 48] have drawn attention recently due to the needs of environment security in several ways. Recognizing/Surveilling two-person activity and/or crowd behavior like punch, kick, handshake, hug, kiss, run or push etc. is of paramount importance to find out normal behavior of the people, which will help to find the behavior of the persons or the activities happening in a crowd. Lao et al. [1] have worked on crowd behavior actions like people counting using Viola-Johns face detector and running detection using optical flow. A review of crowd analysis is presented by Kumar and Sureshkumar [12]. The authors have tackled three most important issues including abnormal crowd detection, crowd tracking, and crowd behavior understanding using Gaussian Mixture Model (GMM), blob analysis, and Histogram of Oriented Gradients (HOG). Laptev et al. [44] have worked on human actions in realistic video settings using spatio-temporal features. To extract feature authors have used Bag-of-Features method and non-linear support vector machine (SVM) for classification of activity. Work on two-person interaction detection system using improved dense trajectory and 3D spatio temporal interest points (STIP) is presented by Shu et al. [46]. The authors have used HOG, Histogram of Flow (HOF) and Motion boundary Histogram (MBH) methods for feature extraction.

4.3 Abnormal activity recognition

An anomaly (rare events) is known as unusual or abnormal activity [3, 12, 36]. Such activities very much depend on the contexts and the surrounding environment, which are not observed frequently. Otherwise, it is normal or usual activity. A suitable approach is to build a model by training normal actions and consider new observation as unusual or abnormal if they deviate very much from the trained model. Al-Nawashi et al. [5] have worked on an intelligent surveillance system for abnormal HAR in an academic environment. The authors have used the temporal differencing method to detect moving object, followed by erosion and dilation binary statistical operations to remove noise. Using shape model they have extracted human motion region and contour coordinates, which are fed to SVM to classify normal or abnormal activity. The authors have implemented intelligent surveillance system using MATLAB application tools. Elarbi-Boudihir and Al-Shalfan [18] have modeled a system to detect an abnormal activity. They have used temporal differencing method for object detection. To track an object binary statistical operations and Kalman filter method is used. The authors have used SVM classifier to classify the behavior and C++ machine learning API to implement the scenario.

5. Survey and analysis of person detection

In this section, we have prepared a classification of the techniques for detecting the human object from image and video data. This classification of techniques is shown in Fig. 5.

Figure 5.

Person detection techniques from images and video.

5.1 Techniques to detect a person from images

5.1.1 Haar-like features

The Haar-like features are image features used for object recognition. Haar wavelets are used in real-time face detection. The method works with only image intensities (i.e., the RGB pixel values at each and every pixel of the image). A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window. There are basically three types of features as shown in Fig. 6. The white area of detection window is subtracted from the black once. The features are extracted from the sub-windows of the sampled image as shown in Fig. 7. The main advantage of the Haar-like feature is fast calculation using an image representation technique called an integral images. The concept of the integral image can be found in [60].

Figure 6.

Basic three types of Haar features.

Figure 7.

Features extracted from sub window of the sampled image [1].

Lao et al. [1] adapted the Viola-Jones face detector and proposed Haar-like rectangle features to detect a face. There are variety of applications where this technique can be utilized like, people counting, special person prohibition to access the restricted area and so on.

Figure 8.

The process of background subtraction.

5.1.2 Histogram of oriented gradients (HOG)

HOG [49] are the feature descriptors widely used in computer vision and image processing to detect the human object. The technique counts occurrences of gradient orientation in small patches of an image. Figure 16 shows the HOG feature extraction method. The HOG person detector uses a sliding detection window which is moved around the image. At each position of the detector window, a HOG [49] descriptor is computed for the detection window which is distributed over eight bins. The HOG descriptor is shown to the trained SVM, which classifies HOG feature descriptor as either person or not a person. Nazare and Schwartz [6] adopt this technique for object detection. We discuss this method in Section 6.3.2 in detail.

5.1.3 Principal component analysis (PCA)

PCA is useful in finding the patterns in high dimensional data and converting data in such a way that it highlights differences and minimize similarities. More importantly, PCA is used for compressing data without losing orthogonal features. To recognize a human face, face templates can be used. The method places the templates at every location on the given image and compares the pixel values in the underlying image region. If the object is scaled, rotated or skewed on the image then, every possible transformation of templates are required, which take a lot of memory. Hence, efficient way to store and search for a match is required. To reduce these costs, a dimensionality reduction method is used to store the image in terms of eigenvectors of the covariance matrix of the face image vectors. The eigenvectors are also called eigenfaces which provide a means of providing data compression. But this technique has a problem with occlusion and outliers.

AdaBoost [1] and Support vector machine (SVM [1, 3, 5, 11, 18] are gaining popularity for person classification from images. AdaBoost [1] is an adaptive boosting, machine learning meta-algorithm. An one AdaBoost model can be combined with subsequent AdaBoost weak learner classifiers to enhance the performance. However, AdaBoost model is sensitive to outliers and noise. Morphological operators are applied to remove noise. We found that the combination of HOG [49] feature vector with SVM classifier, and that of Haar-like feature with AdaBoost classifier gives the best accuracy.

Figure 9.

An example of person detection technique using GMM. Image source: [57].

5.2 Techniques to detect a person from video

A video is captured by closed-circuit TV (CCTV) cameras. These cameras can be of two types. The first is static camera i.e. located at fix position and the second is a moving camera. A number of cameras are deployed at airports, homes, giant corporate buildings, academic or government organizations, and hospitals. Also, cameras are heavily used for traffic monitoring and border protection, to capture human motion. In this section, we present techniques to detect a person from video for both the type of camera: static and moving.

5.2.1 Static camera

This subsection presents the techniques to detect human from static camera, which are fixed location cameras that do not have any tilts or pans. Videos are captured at a single viewpoint; hence, human or object detection techniques are simpler compared to the techniques used for moving camera.

Background subtraction

Background subtraction [2, 6, 10, 16, 20] is a popular method to detect object from the consecutive frames, which is illustrated in Fig. 8. Each frame is compared with previous frame pixel by pixel which removes identical pixels. By subtracting same pixels, method distinguish background and foreground object in a frame. In paper [9], authors have used background subtraction method along with the optical flow to mask the moving region to detect aggressive behavior. Kumar et al. [12] and Niu et al. [19] have also used this technique to fragment the object. Moreover, image and video data may contain noise. The random brightness and distorted color information in an image can destroy the important information. To remove noise some morphological operations, erosion and dilation are applied to the segmented image. Erosion removes all the single pixels. After eroding an image, a dilation operation is applied, which makes object large and fills the gaps in between and makes interested object clean. And then detected object can be fed to subsequent steps of video processing.

Figure 10.

An example of Lukas-kanade-Tomashi tracker. The red dots denote LKT feature points and yellow line denote the tracking by one LKT tracker. Image source: [24].

Gaussian mixture model (GMM)

GMM [11, 14] is unsupervised foreground extraction technique. The method models values of pixels as one type of distribution. To deal with different background situations like illumination changes, shadows, clutter and other arbitrary, instead of one Gaussian per pixel, method models a mixture of Gaussians. To allow adaption to multimodal environments, GMM [14] has been applied in many fields. In 2016, Marsden et al. [3] have used Gaussian mixture based model to segment foreground. The history of each pixel is modeled by a mixture of $K(K\leqslant 5)$ Gaussian distributions. For prediction, initially, K-Gaussian distributions are normalized. When a new pixel is observed in an image sequence, high probability of a pixel value is most likely to be in the background. Hence, the pixel of an image is classified as a foreground object if the pixel has less probability than a predefined threshold. In 2016, Li et al. [4] have worked on non-tracking based Oriented GMM approach. The authors have quantized the orientation of optical flow and trained a GMM at each orientation. An example of GMM can be seen from Fig. 9.

Segmentation by tracking

Tracking methods are divided into three categories: Point tracking, silhouette tracking and kernel tracking. Using interest points, silhouette of an object, and moving regions frame by frame, we can segment the motion of an object by tracking. The task of segmentation/tracking of a video object emerges in many applications like video conference, surveillance and bank transactions monitoring.

5.2.2 Moving camera

In the moving camera, two types of motions are captured: the motion of the camera and the motion of an object. Therefore, to handle multiple motions temporal differencing and optical flow based methods are used to detect a human from the video.

Temporal differencing:

The most suited method to extract foreground in the moving camera is temporal differencing [7, 20] method. In 2012, Elarbi-Boudihir and Al-Shalfan [18] have detected moving object in video stream using temporal differencing method. Also In 2016, Al-Nawashi et al. [5] have used temporal differencing method and then located motion regions using Gaussian function. In videos captured by moving camera, the background is changing over time. Hence, it will be inappropriate to generate background model in advance. Instead, method detects the moving object by taking the difference of consecutive frames $t-1$ and $t$ , which produces a better result. If $f_{n}$ is the intensity of $n^{\rm th}$ frame of the shot then method use absolute difference function, which is shown in Eq. (1).

$\displaystyle\Delta n=|f_{n}-f_{n-1}|$ (1)

The difference result is binarized in order to separate changed pixels and we get motion image $M_{n}$ shown in Eq. (2).

$\displaystyle M_{n}(u,v)=\begin{cases}F_{n}(u,v)&\text{if∼{}}\Delta n(u,v)% \geqslant T\\ 0&\text{if∼{}}\Delta n(u,v)<T\end{cases}$ (2)

where $T$ is an appropriate threshold. If frames contains noise, then morphological operators, erosion and dilation are used to get rid of noise. Many times connected component analysis is applied to cluster the moving targets. This method gives the best result in the case of occlusion.

Optical flow:

Another technique for segmentation of motion in moving the camera is optical flow [4]. It denotes the displacement of the same scene in the frames at different point of time. Lucas-Kanade-Tomasi (LKT) [24] feature tracker effectively work on local pixel value and robustly selects corner feature points of the reference image patch. An example is shown in Fig. 10. Optical flow is discussed in more detail in Section 6.3.2.

6. Survey and analysis of human action recognition

In this section, we present a survey of related problems in HAR based on a number of people involved in activities by using video processing techniques followed by a survey of work done using 3-D space-time volume features.

6.1 Survey and analysis of related problems in HAR using video processing techniques

This subsection describes the major works that have been done by researchers in different domains of video surveillance such as pose estimation, action recognition, and abnormality detection etcetera. We have highlighted the techniques and methodologies that are used by researchers. The survey is based on a number of people involved in an activity such as single person activities, two people interaction, and crowd activities. Tables 2–4 present the survey of total 15 major headway selected research papers for single person activity, two people activity and crowd activity, respectively.

We prepared the survey based on following criteria: object or motion detection techniques, object classification techniques, object tracking techniques, behavior analysis, features extracted, models used, the dataset used, actions targeted, challenges addressed, and accuracy achieved. From the Tables 2–4 we can see, the majority authors have used background subtraction technique for object segmentation. If images or videos are captured from a single viewpoint, background subtraction technique proves to be less expensive and more efficient. This technique is discussed in Section 5.2.1 in detail. To track an object, the majority of researchers have used KLT-tracker. Because this method is robust to occlusion and also computationally efficient. In Section 5.2.2 we illustrate KLT-tracker with figure. To classify actions, support vector machine (SVM) is majorly used with Bag-of-Visual-Words (BoVW) feature extraction method. We discuss classification models in Section 7 in detail. By using tabular information, the naive researcher can thoroughly understand and outline the methodologies used by the researchers for HAR.

6.2 Survey of related problems in HAR using 3-D space-time volume features

This subsection describes the major works that have been done by researchers for Human Activity Recognition (HAR) using 3-D spatio-temporal features. Human activity is described as a time series of human postures. To capture maximum information from videos, we require efficient feature representation techniques. We present a survey of work done by researchers using lower level features such as spatio-temporal interest point (STIP). To detect interest points various methods are used namely Harris corner detector, Minimum eigenvector, Fisher vector, and 3-D scale invariant feature transform (SIFT). To detect the motion of object, trajectories are formed around the feature detector. Basically, there are four types of trajectories namely, Dense Trajectories [34, 42, 47], KLT Trajectories [30], SIFT Trajectories [30] and dense cuboid [48]. Now we present related work done by different researchers.

Human interaction recognition using improved spatio-temporal feature extraction technique is used by Sivarathinabala and Abirami [8] for object detection. The spatio-temporal interest points are detected using the Harris corner detector. These points are then analyzed and tracked using optical flow to find out motion of an object. Computed Feature descriptors are then fed to Hierarchical-SVM (H-SVM) to recognize human interaction. The authors have presented worked on UT-interaction [28] and BIT interaction [29] dataset and considered five types of actions like Running, Handshake, High-five, Push and Punch activity. With a Bag-of-Feature extraction technique, they have achieved 58.20% accuracy on UT-interaction [28] dataset and using their own method they have achieved 90.1% of accuracy.

Wang et al. [30] have presented action recognition by dense trajectories. The feature trajectories are tracked using KLT tracker and matching SIFT descriptor. Authors sampled the dense points from each frame and tracked using optical flow field. They have used

Table 2
Survey of HAR for single person activities

Ref. No.	Object segmentation	Object tracking	Behavior analysis	Features extraction technique	Classification model	Dataset	Human action targeted/Challenges addressed	Accuracy/frame-rate
[2]	Background subtraction (BS), connected component labeling (CCL)	Mean-shift tracking (MS Tracking)	Not available	Color histogram (CH)	Not available	10-minute video, Own dataset	People labeling, trajectory recording, posture analysis, falling down detection/Occlusion	92%/22 fps and 8 fps
[5]	Temporal-differencing (TD) erosion, dilation	Shape based model based on OMEGA equation	Obtained the pattern of S	Motion threshold	SVM	Own dataset	Running, jumping, falling, flying	Not available
[6]	BS, HOG	Kalman filter (KF)	HOG, LBP, SIFT	HOG, gray level co-occurrence matrix (GLCM)	Haar feature-based cascade classifier	Own dataset	Pedestrian detection, face recognition, re-identification	Not available/20 fps
[9]	BS	KF	Average velocity OF	Optical flow velocity, KF	Threshold	Own dataset	Sudden movement and falling down	65–75%/25 fps
[14]	GMM and code book, CCL	Color histogram and LBP and object matching	Cmotion (MHI) extraction and quantization in sliding window	Acceleration, axial ratio, context information histogram, estimated height	HMM, relevance vector machine (RVM)	Own dataset	Fall detection	Not available/ 26.03 fps
[15]	Viola and Jones face detection	KLT tracker, particle filter	CH, ratio histogram	Touching time between the face and hand samples, smoke detection	Decision tree (DT)	Own dataset	Smoke detection, drinking, handheld object detection.	Smoking – 88.02% drinking – 93.79%
[17]	MHI	Optical flow (OF)	Not available	Positions, differential information, directions, and speeds from the moving body, torso, left and arm, left and right leg	HMM	Carnegie mellon university motion capture database, UTD, VICON	Walking, sitting down, standing up, running	Not available/ 120 fps

Table 3

Survey of HAR for two-person interactions

Ref. No.	Object segmentation	Object tracking	Behavior analysis	Features extracted	Classification model	Dataset	Human action targeted/challenges addressed	Accuracy/ framerate
[8]	Blob	Optical flow	Not available	Harris corner detector, HOG descriptor	H-SVM	UT-interaction [28], BIT interaction [29]	Poses like handshaking, kicking, punching/occlusion	UT-90.1%, BIT-88.9%
[11]	Blob fusion algorithm,GMM, fuzzy models	Linear sum assignment problem (LSAP), KF, SVM kernel	Not available	Global color histogram (GCH), Local binary pattern (LBP), HOG	Not available	CAVIAR (Context – aware vision using image-based active recognition [26]	Entry and exit, loitering event, un attended cash desk/Occlusion management	86.4%/5-Fps
[13]	Motion interchange pattern (MIP)	Large displacement optical flow (LDOF)	Self-similarity matrix (SSM) based on HOLDOF	Torso (body excluding the head and neck and limbs)	Structured SVM (S-SVM)	TVHI	Handshake, hug, high five, kiss	Not available
[18]	TD, erosion and dilation	KF	Threshold	Corners, area, ratios	SVM	Not available	Not available	Not available

Table 4

Survey of HAR for crowd activities

Ref. No.	Object segmentation	Object tracking	Features extracted	Classification model	Dataset	Human action targeted/challenges addressed	Accuracy/ framerate
[1]	Viola-jonesf́ace detector, bounding box	Optical flow (OF)	Average of sorted least mean square values of optical flow	– Adaboost – SVM	KTH [50], UCF sports [25]	Running/crowded, close shots and distance shots scene	Precision – 67.8% Recall – 90.1%/25 fps
[3]	Not available	KLT tracker	Crowd collectiveness, crowd conflicts, crows density, mean motion speed	– GMM – SVM	UMN, violent-flows dataset	Abnormal crowd events/	Accuracy: SVM: 85.53% GMM: 65.8% 40 fps
[12]	BS using adaptive thresholding	HOG	Not available	State-space demonstration	UMN for unusual crowd activity	Abnormal crowd	96%
[19]	BS, human silhouette	Second order kalman filter	A unique label assigned to a person, height, centroid, average intensity, the value of region occupied	Kalman model	Recorded 3 minutesv́ideo using a samsung SCD minDV camcorder	Multi person tracking/occlusion management	85%

trajectory, Histogram of Oriented Gradient (HOG), and Histogram of Optical Flow (HOF) feature descriptors. These descriptors are normalized with L2 normalization. Therein unnecessary camera motion generated by optical flow is suppressed by applying Motion Boundary Histogram (MBH), where derivatives are computed separately for horizontal and vertical components. To generate a codebook Bag-of-Features method is used. The authors have used non-linear SVM with $\chi^{2}$ Kernel for classification, and worked on KTH [50], YouTube [21], Hollywood2 [21], and UCF Sports [25] datasets. By using trajectory descriptor they have achieved 90.2% of accuracy on KTH [50] dataset and 47.7% on the Hollywood2 dataset.

Kantorov and Laptev [42] have presented work on efficient feature extraction techniques, feature encoding techniques, and classification for action recognition using motion information in video compression. They have encoded feature vector using Fisher vectors (FV) and discriminated actions using Gaussian mixture model (GMM), which is a fast linear classifier. The authors shown the use of sparse MPEG flow instead of dense optical flow. This technique improves the feature extraction process. Then HOG [49] is computed to discretize MPEG flow into eight orientation bins and one no motion bin. Similarly, MBHx and MBHy are discretized. For each descriptor, vocabulary (codebook) is constructed using K-means. They have used non-linear SVM with $\chi^{2}$ Kernel for classification, and applied on Hollywood2 [21], UCF50 [25], UT-interaction [28] and HMDB51 [21] datasets.

Figure 11.

Types of features used for activity recognition.

Figure 12.

Feature extraction, a two-step process: Feature detection and feature descriptor.

Shi et al. [48] have done work on Human Action Recognition (HAR) using trajectory-based representation. In their paper, they represented actions by extracting spatio-temporal features. Spatio-temporal interest points (STIP) are detected using cuboid detector using scale-invariant feature transform (SIFT) and build trajectories by matching STIPs in consecutive frames. The trajectory exploration is done using KLT tracker. For each motion trajectories HOG, HOF, and MBH descriptors are computed. They have fixed Bag-of-Visual-Words size to 3000. For classification of actions, they have used non-linear SVM classifier with $\chi^{2}$ Kernel. They have evaluated their proposal on KTH [50] Weizmann [51] and UCF sports dataset [25]. For KTH dataset [50], they have obtained 95.36% of accuracy using HOF descriptor, and 94.9% accuracy using MBH descriptor.

Nour et al. [52] have presented work on HAR based on co-occurrence of visual works. In their work, the 3D-XYT volume is extracted using the 3D-SIFT technique. Then for two-person interaction, a two-dimensional co-occurrence matrix is constructed. The experiment is performed on UT-interaction dataset [27], which contains two sets namely set 1 and set 2. The videos of set 1 are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The videos in Set 2 are taken in a windy day, thus background is moving slightly and they contain more camera jitter. For set 1 they have used K-NN classifier using Euclidean distance function and got 40.63% of accuracy. For set 2 SVM classifier with the polynomial kernel is used and got 66.67% of accuracy. The experiment was carried out in the static background with two performers (actors).

Lo and Tsoi [54] have presented work on HAR based on motion boundary dense trajectory technique. In their approach, interest points are selected using Harris corner condition. Then these points are tracked frame-by-frame and trajectory can be extracted. The optical flow field is computed over a two-frame sequence t and t+1 using median filtering. To form a trajectory descriptor, the tracked points of subsequent frames are then concatenated temporally. Then HOG, HOF and MBH descriptors are used to find out the shape of a trajectory or local motion information. To convert the local descriptors from a video into a fixed dimensional vector, Bag-of-Features method is used and the codebook is constructed using K-means clustering algorithm. For classification they have used, non-linear SVM with $\chi^{2}$ Kernel and worked on datasets such as UCF-Sports [25], YouTube [21], Olympic Sports [21], HMDB51 [21], Hollywood2 [21], and the UCF50 [25]. With motion boundary trajectories computed using MBH descriptor approach they obtained 93.5%, 63.8%, 64.4% and 92.2% of accuracy for Olympic Sports [21], HMDB51 [21], Hollywood2 [21], and the UCF50 [21] datasets respectively.

6.3 Survey and analysis of types of features and feature detectors-descriptors used for activity detection

This subsection presents types of features followed by feature detector-descriptor approach. Moreover, it also presents feature vector encoding methods used by researchers.

There are basically two types of features, such as low level (local) features and high-level (holistic) features. The prepared classification of types of features is presented in Fig. 11. Local features are most commonly used features as they are less sensitive to noise, viewpoint, and illumination changes. These features are extracted by detecting an interesting points in frames. The high-level (holistic) features capture structural information related to the action being performed. Once local interest points are detected, a descriptor is found around the interest point. The descriptors capture local information like SIFT, color, trajectories, texture-based, or shape-based [53] information for a detector.

In computer vision, the feature extraction process is a two-step process, which is shown in Fig. 12. The first step is the finding feature detector and the second is to compute the local descriptors for this feature detector. The feature detector tries to locate the important key points in the sequence of frames. Once key points are identified, its specifications are described by the feature descriptors. Next we describe the popular feature detector techniques, followed by feature descriptors [53].

In Section 6.3.1 we present a survey of Feature Detectors used for activity detection, and in Section 6.3.2, we present the survey of feature descriptors used for activity recognition. Initially, Motion History Images (MHI) [17] and Motion Energy Images (MEI) [13] methods are used by researchers as temporal image templates which show, how the object is moving and where the motion has occurred respectively. These templates are matched using the nearest neighbor approach against the examples of given motions learned. However, the problem is method lose some useful information and it also requires adequate training samples. Moreover, if one person partially occludes other, the motion and shape information may get lost. To overcome these problems nowadays, enhanced descriptors are used like Histogram of Oriented Gradient (HOG) [49], Histogram of Optical Flow (HOF) [42, 43, 44, 49] and Motion Boundary Histogram (MBH) [41, 42, 43, 44].

6.3.1 Feature detector

In this subsection we present different feature detectors.

Space-time interest point (STIP)

Interest points are the feature points or corner points. Interest points are points having a well-defined position in an image, as shown in Fig. 13. Minimum Eigenvector detection techniques are used to find interest points. These spatio-temporal corners are located in a region that shows a high variation of intensity in all the directions $(x,y,t)$ . Minimum eigenvector and Harris3D [8, 18] are most popular STIP detection techniques. Sivarathinabala and Abirami [8] have worked on human interaction recognition on the improved spatiotemporal feature. The authors have used Harris corner detector technique to detect STIP points. Dawn and Shaikh [39] have presented a review on HAR based on Spatio-temporal approaches.

Figure 13.

STIPs detection for two person Kick activity. The STIP detection is done on an image of UT-interaction dataset [21].

Figure 14.

Step by step trajectory formation process. The trajectory formation is done on an image of UT-interaction dataset [21].

Improved dense trajectories

A trajectory is the path followed by the object. In other words, it shows a pattern of motion. Figure 14 depicts the step by step trajectory formation process. For action recognition, the motion is a most important thing to identify. Thus, dense trajectories are very much used to track the motion of the person [30]. These trajectories can be computed by tracking feature points through multiple frames either using optical flow or KTL Tracker [15] or by matching Scale Invariant Feature Transform (SIFT) [6] descriptor. Then local descriptors like HOG and HOF are computed around the trajectory. Figure 15 represents the steps to compute trajectory in detail. Initially, feature points are computed on each spatial scale to guarantee equally cover all spatial positions and scale. The authors [41] have used sampling step size $W=$ 5. Now the goal is to track these points throughout the frames of video. The points in homogeneous areas are impossible to track. So only the points whose eigenvalues of an auto-correlation matrix is higher than some threshold $T$ are selected as strongest feature points. The strongest feature points are tracked up to $L$ ( $L=$ 14 or 15) frames using optical flow to form a trajectory. Then shape and motion descriptors namely HOG, HOF and MBH are computed around the trajectories. Finally, feature descriptors are fed to the classifier to classify the actions.

Figure 15.

Extracting Dense Trajectories. Image source: [41]. Left, Densely sampled feature points on each spatial scale. In middle, Tracking feature points separately on each spatial scale using median filtering for L frames in a dense optical flow field. Right, $N\times N$ pixels neighborhood divided into the $\eta\sigma\times\eta\sigma\times\eta\tau$ grid is used to compute (HOG, HOF, MBH) descriptors along with the trajectory.

Figure 16.

A process of the HOG feature extraction method. Image source: (a) the input image; (b) gradient map with gradient strength and direction of a sub-block of the input image; (c) accumulated gradient orientation; and (d) histogram of oriented gradients [54].

Figure 17.

Histogram of oriented gradient (HOG) plotting on interest points. The HOG plots are computed on an image of UT-interaction dataset [21].

6.3.2 Local feature descriptors

Histogram of oriented gradients (HOG)

Yang et al. [49] have proposed a HOG feature descriptor for robust object recognition. The HOG [24, 30, 34, 41, 42, 43, 44] is calculated based on evaluating normalized local histograms of image gradient orientations in a dense grid, by considering fine-scale gradient and fine orientation binning. The process of HOG feature extraction is illustrated in Fig. 16. An $n\times n$ detection window is moved over an image. Then histogram of oriented gradients is computed for each $n\times n$ size cell. Finally, Histogram of each cell is concatenated to form a histogram of one block. Furthermore, a histogram of one block is concatenated which forms a histogram of an image, and then a histogram of all images or frames are concatenated to form a final histogram of the video. lastly, descriptor has to be normalized to improve accuracy. The computed HOG is depicted in Fig. 17.

Magnitude and orientation of an image gradient of all trajectories around a point $I(x,y)$ are calculated by Eqs (3) and (4).

$\displaystyle M(x,y)=\sqrt{(I(x\!+\!1,y)\!-\!I(x\!-\!1,y))^{2},(I(x,y\!+\!1)\!% -\!I(x,y\!-\!1))^{2}}$ (3) $\displaystyle\Theta(x,y)=\tan^{-1}\begin{matrix}\\ \end{matrix}\frac{{I(x,y+1)-I(x,y-1)}}{I(x+1,y)-I(x-1,y)}$ (4)

Histogram of optical flow (HOF)

The HOF [42, 43, 44, 49] is a statistical representation of orientation and magnitude of optical flow. Optical flow shows the direction of motion of the objects in an image (frame). It computes the two-dimensional displacement vectors, assuming same motion in neighboring pixels and intensity. The change in the pixel to the next frame for location $(x,y,t)$ is $d x, d y$ , and dt in all three directions shown in Eq. (5).

$\displaystyle I(x,y,t)=I(x+dx,y+dy,t+dt)$ (5)

By, expanding terms of Eq. (5) we get Eq. (6.3.2).

$\displaystyle I(x+dx,y+dy,t+dt)=$ $\displaystyle I(x,y,t)+\frac{\partial I}{\partial x}dx+\frac{\partial I}{% \partial y}dy+\frac{\partial I}{\partial t}dt$ (6)

In Eq. (6.3.2), if we ignore the higher order terms, we get the following Eq. (7).

$\displaystyle\frac{\partial I}{\partial x}u+\frac{\partial I}{\partial y}v+% \frac{\partial I}{\partial t}=0$ (7)

where $x$ and $y$ are the velocity and the optical flow of image $I$ .

By brightness consistency assumption, $\partial I/\partial t=$ 0. Now, there are two unknowns $(u,v)$ , and one Eq. (7). To solve these two unknowns Lucas-Kanade [24] method is used.

The Optical flow is computed for an image patch of $3\times 3$ around a pixel. There are nine total linear equations. These orientations are weighted by the magnitude of optical flow. Finally, optical flow vectors are binned into nine sets. One extra bin is for the zero angle, which takes all vectors having lesser magnitude vectors than a given threshold. This method is well suited for finding motion information.

Motion boundary histogram (MBH)

Optical flow nearly captures all kinds of motion. Video data may have camera motion and foreground motion. The Motion boundary histograms (MBH) [41, 42, 43, 44] were proposed by Dalal et al. to suppress camera motion. The smooth variation of camera motion cancels out by computing local derivatives of the flow. MBH is calculated for human motion recognition by gradient optical flow separately for horizontal (MBHx) and vertical (MBHy) flows. Both histogram vectors are normalized separately using L2 Norm. MBH is very robust to the camera motion than optical flow. Therefore, this descriptor is more discriminative for action recognition.

6.4 Feature vector encoding

To encode features, Bag-of-Features or Fisher vector (FV) techniques are used. Bag-of-Visual-Words (BoVW) [30, 34, 41, 42, 44, 46] is very popular technique. The codebook is generated by a set of local descriptors, then they are summarized into a fixed length vector. A fixed number of features are extracted from each descriptor randomly and are clustered using the K-means clustering method. Descriptors are assigned to the closest cluster using Euclidean distance. The problem with BoVW method is, it considers the only frequency of words and also very sparse having a lack of effective representation. To address this issue, higher order visual information is captured by the Fisher Vector method and used in large-scale image classification [34, 47]. Fisher Vector uses Gaussian Mixture Model (GMM) [3, 4, 14] to encode the feature.

7. Survey and analysis of models used for activity classification

To achieve good performance, selection of good classification algorithm for the selected feature is essential. This section presents well-known models that are used for activity classification. The activity recognition is seen as the problem of classification of video clips into different categories (labels). A generalized process of video classification is shown in Fig. 18. Extracted features are encoded using an feature encoding mechanism and generates a codebook. Then codebook (encoded features) is fed to the classification model. And finally, model classifies the action performed in a video.

7.1 Support vector machine (SVM)

Support vector machine (SVM) [1, 3, 8, 13, 18] is the eminent class of discriminative classifier. SVM model uses a kernel function or basis function that will map the original problem of finite dimensional space to much higher dimensional space to make separation of classes easier. Support vector machine with RBF Gaussian kernel is widely used in a Bag-of-Words feature context [41, 42, 44, 46] which is majorly used in Human Activity Recognition (HAR). Wang et al. [30, 34] have also used non-linear SVM with $\chi^{2}$ Kernel for action classification.

Figure 18.

Block diagram of video classification. Image reference: [58].

Table 5

Survey on image processing and computer vision tools and libraries

Tools	Supporting libraries	Advantages	Disadvantages
Python	• Python openCV binding. • SciPy (open source software) supports MATLAB – like capabilities. • NumPy. • SimpleCV. • Python imaging library (PIL). • Scikit-image.	• Flexible, general-purpose, object-oriented scripting language. • Much faster for prototyping and experimentation. • Lots of other python libraries are available.	• Less control over memory allocation. • May run slower compare to C program. • Comparatively not as easy as MATLAB.
MATLAB	• Computer vision system toolbox. • Closed source (Huge product license cost).	• Great prototyping/research language. • Built-in visualization tools available. • Full support and documentation. • Faster programming	• May be slower than c, for some code. • Poor general purpose language. • Less efficient than openCV.
R-biOps package	• For UNIX and windows: Install libjpeg dev and library. • Rvision, ROpenCVLite. • Other library: libtiff, libfftw.	• Preferred if your work is statistics-heavy. • Nice output (plotting).	• Limited availability of image/vision libraries. • No direct capture from video sources, AVIs, etc. • Drawing image is slow and requires patience.
OpenCV	• Computer vision lib, high GUI library, machine learning algorithms.	• Cross – platform. • Extensive library with broad applicability. • More efficient than MATLAB.	• Difficult to track memory leaks. • Slower programming than MATLAB.
Matrix imaging library (MIL)	• Image analysis, video analytics, machine vision, and medical imaging.	• Commercial/Industrial package. • Focus on machine vision and hardware acceleration.	• GPU’s, FPGA’s custom image acquisition hardware. • Expensive.
Image tool kit (ITK)	• C++ libraries for medically oriented imaging research.	• Cross-platform. • Popular for medical imaging. • Segmentation and registration. • Open source (with python bindings).	–

Tools

Supporting libraries

Advantages

Disadvantages

Python

•

Python openCV binding.

•

SciPy (open source software) supports MATLAB – like capabilities.

•

NumPy.

•

SimpleCV.

•

Python imaging library (PIL).

•

Scikit-image.

•

Flexible, general-purpose, object-oriented scripting language.

•

Much faster for prototyping and experimentation.

•

Lots of other python libraries are available.

•

Less control over memory allocation.

•

May run slower compare to C program.

•

Comparatively not as easy as MATLAB.

MATLAB

•

Computer vision system toolbox.

•

Closed source (Huge product license cost).

•

Great prototyping/research language.

•

Built-in visualization tools available.

•

Full support and documentation.

•

Faster programming

•

May be slower than c, for some code.

•

Poor general purpose language.

•

Less efficient than openCV.

R-biOps package

•

For UNIX and windows: Install libjpeg dev and library.

•

Rvision, ROpenCVLite.

•

Other library: libtiff, libfftw.

•

Preferred if your work is statistics-heavy.

•

Nice output (plotting).

•

Limited availability of image/vision libraries.

•

No direct capture from video sources, AVIs, etc.

•

Drawing image is slow and requires patience.

OpenCV

•

Computer vision lib, high GUI library, machine learning algorithms.

•

Cross – platform.

•

Extensive library with broad applicability.

•

More efficient than MATLAB.

•

Difficult to track memory leaks.

•

Slower programming than MATLAB.

Matrix imaging library (MIL)

•

Image analysis, video analytics, machine vision, and medical imaging.

•

Commercial/Industrial package.

•

Focus on machine vision and hardware acceleration.

•

GPU’s, FPGA’s custom image acquisition hardware.

•

Expensive.

Image tool kit (ITK)

•

C++ libraries for medically oriented imaging research.

•

Cross-platform.

•

Popular for medical imaging.

•

Segmentation and registration.

•

Open source (with python bindings).

–

7.2 Gaussian mixture model (GMM)

For pattern classification, Gaussian mixture model [3, 4, 14] is a powerful probabilistic model. Each human action is represented by a GMM by clustering the motion in every training sequence. The GMM is trained using the Expectation-Maximization (EM) algorithm. EM algorithm initializes by performing K-means algorithm. K-means is employed with N different random initialization. When EM converges, cluster labels are obtained. A GMM the is fastest to learn mixture model. The method only maximizes the likelihood. thus, it is computationally efficient.

7.3 Hidden markov model (HMM)

The generative models have very powerful algorithms known as Hidden Markov Model (HMM) and Markov random process. HMM, [17] proved best with sequential information set, hence it works well with the speech-related task. The feature vector (observation sequence) can be a discrete symbol or continuous density. Based on that, HMM can be either Discrete HMM (DHMM) or Continuous Density HMM (CDHMM). In HAR, for every action, one HMM is generated and its probability is computed. Then for the new testing set, the model will compute its probability. Based on similarity measures, the model will classify the human action performed.

Figure 19.

Video surveillance benchmark dataset for constrained and unconstrained environment.

8. Survey and analysis of tools and action video datasets

8.1 Survey and analysis of tools

In this section, we present tools and libraries which are available for vision processing tasks. Table 5 presents these tools and libraries along with their advantages and disadvantages. Most popular languages are: MATLAB (Matrix Laboratory) is a high-perfor- mance language for technical computing [5]; OpenCV (Open Source Computer Vision) library for real-time computer vision [6]; BiOps package in R-Tool which provides some image processing capabilities; Python programming language is also used in computer vision and has gained good popularity because of the good support of the scripting language. Other libraries are also available i.e. Matrix imagine library (MIL) and Image Toolkit (ITK) in form of C and C++ languages [18]. There are some other C++ libraries like Clmg, vgui, and iLab Neuromorphic Vision are popular for vision processing. Moreover, an open source Java computer vision library BoofCV is written from scratch for real-time computer vision. Another C++ and .NET Adaptive Vision Library is created for industrial image analysis applications. Many other software tools and packages are available and considered in detail in [55, 59].

Figure 20.

Block diagram of proposed work.

8.2 Survey of action video datasets for controlled and uncontrolled environment

Video surveillance benchmark dataset for the constrained and unconstrained environment for HAR is shown in Fig. 19. In the unconstrained or uncontrolled environment, the data were collected in an outdoor environment using video surveillance cameras of various qualities with the uncontrolled background. In addition, the activities are performed by different performers (or actors) with different clothing conditions and doing multiple activities. This type of datasets are UCF Sports action dataset [25], CAVIAR [26], Human motion dataset [27], UT-interaction [28], PETS [56], BIT-interaction dataset [29]. Whereas, in controlled or constrained environment background is static with no change in illuminance and appearance. In a controlled environment, the same actors are performing different activities. The KTH [50] and Weizmann [51] datasets follow under the controlled environment category. In an uncontrolled environment, it is challenging to recognize the action performed due to variation in camera position (different viewpoint), changing background with different actors and clothing conditions. We have compared different video datasets based on various characteristics like: No. of Actions performed; No. of Actors; Resolution; Frames per second (Fps); Total no. of videos; various scenario considered; Types of HAR system based on people involved, are depicted in Fig. 19.

9. Proposed work

Based on the analysis and findings carried out in each section we have prepared our proposed work in the same direction. In our work, we want to model the activities to find out human aggression, which may cause a serious problem. We try to recognize the unpleasant and harmful behavior of humans. Thus, we consider three different activities like punching, kicking and pushing back. We will use UT-interaction [28] and BIT-interaction [29] datasets for our work to identify two people interaction activities. In UT-interaction dataset, videos are divided into two sets: The set 1 is made of 10 video sequences taken on a parking lot. The videos are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The set 2 is taken on a lawn on a windy day. The background is moving slightly (e.g. tree moves), and they contain more camera jitters [28]. And BIT-interaction dataset contains 100 videos for each action. Initially, we need to preprocess the video data to its same codec information. Frames are extracted from the video to its fixed height and width. Once preprocessing is done, we have followed the step as shown in Fig. 20. In the next subsections, we present each step of proposed work in detail.

9.1 Spatio-temporal interest points detection

Once we have frames ready, we have to detect the interest points. We will use spatio-temporal interest points (STIP). These local features are robust and less sensitive to noise. To detect interest points Minimum Eigen Feature detection [8, 18, 30, 46] method will be used in our work. From all the detected points, we will select the strongest points which carry maximum information. These are really good features to track [48] because they are scale and rotation invariant. STIP points are described in Subsection 6.3.1.

9.2 Feature points tracking using optical flow

To find the pattern of motion we have to observe the optical flow of human performing an action, which is a distribution of apparent velocities. There are two methods to find an optical flow, namely Horn-Schunck and Lukas-Kanade. Horn-Schunck is a global method and more sensitive to noise. Lukas-Kanade [1, 3, 8, 9, 24] is a local method and assumes that flow is constant in the local neighborhood of the pixel. This method is less sensitive to noise. The Lukas-Kanade method is only used when the motion flow between two images is small. We will use Lukas-Kanade method to find the motion of an object.

9.3 Compute feature descriptor

Based on a survey carried out in Section 6.3.2. We will use improved dense trajectory to captures human motion. We find out the trajectory to get motion features. To calculate the trajectory, feature points are tracked up to the next L number of Frames (e.g. 14 or 15 frames) using optical flow. In addition, the tracked points of subsequent frames are then concatenated temporally up to the $L$ frames to form a trajectory which captures the motion in a better way and has shown good results [34, 43, 44, 45]. We will also use HOG feature descriptor to find out spatial (pose) information, which effectively computes the shape information. In addition, the complexity of background does not affect the HOG feature detection performance because HOG [34, 43, 44, 45, 49] descriptor is robust to a vast range of poses, blurriness, sharpness, illumination changes as well as color saturation. Another descriptor is MBH $(x/y)$ feature, which effectively nullifies the camera motion and highlights the object motion. Hence, Motion Boundry Histogram (MBH) [41, 42, 43, 44] descriptor is more robust to the camera motion than optical flow methods. Hence, MBH descriptor is widely used if videos are captured in an uncontrolled environment.

9.4 Feature vector encoding

The feature descriptors which are generated in the previous step are large in size. Therefore, to reduce the space (memory) and time required by machine learning operations, we need to limit the size of the feature descriptor. We will apply a feature subset selection method to avoid the curse of dimensionality. In order to convert the local descriptors from a video into a fixed-dimensional feature vector, we will project original high-dimensional data into a lower-dimensional subspace using random matrix [30, 34, 41, 48, 54] whose columns have fix lengths. Once features are selected, we will apply principal component analysis (PCA) technique to reduce the dimensionality of data, which reduces the data down into its basic principal components (PCs) and store them in an ordered fashion. The PCs are the eigenvalues of a covariance matrix and hence, they are orthogonal. Furthermore, the first PC retains maximum variation that was present in the original components. Using this method, we will generate the Bag-of-Visual-Words (BoVW) [30, 34, 41, 42, 44, 46] which can be fed to a classifier.

9.5 Classification

Generated visual-words in previous step are fed to the SVM classifier to classify the action performed. We will use the non-linear SVM classifier with a multi-dimensional $\chi^{2}$ Kernel [1, 3, 8, 13, 18, 30, 34, 41, 42] to classify different actions such as punch, kick, and push back. The most prominent strength of SVM classifier is its strength of classifying high dimensional data with good accuracy. It works on principle of finding Maximum Margin Plane by dividing data from various classes. To find Maximum Margin Plane, variety of kernal functions are used. To obtain the accurate classification, the parameters of the Kernel functions are need to be fine-tuned.

10. Conclusion

In this article, the objective is to explore the concepts of video surveillance and specifically, human action recognition (HAR) because automated human activity recognition is becoming crucial in this modern era. This article discussed the concepts of human action recognition in depth. The article presented a taxonomy, having multiple dimensions, of HAR followed by different techniques to recognize human from the image and video data. Furthermore, the article has also provided a survey and analysis of related problems in HAR using various video processing techniques. The article broadly presented the survey of work done in HAR using 3-D space-time features. Specifically, article presented types of features used for HAR along with feature detector-descriptor approach to model the human activity. The article also discussed classification models used for human activity recognition. In addition, the article also presented computer vision tools in trends followed by benchmark human activity dataset.

We have utilized our survey and study presented in this article to propose our research work. This article also presented a detailed schematic diagram of proposed work and discussed each step in detail. Our method differs from a previous Bag-of-Visual-Words (BoVW) generation techniques. In our proposal, we have selected strongest spatio-temporal interest points (STIP), which are scale, rotation and viewpoint invariant, to reduce the noise and redundancy level. Based on this survey we have found out best feature descriptors for the feature detector, which carry maximum physical and temporal information in video data. Surveying and reviewing existing available work help us to comprehend and answer following questions in a better way: Different video processing domains, with strategies of steps involved in general video surveillance systems; Different approaches to recognize human activity; How to detect and analyze actions and behavior for different HAR systems such as, Single person activity, Two-person interaction and Crowd behavior. Rigorous study of this article can assure newbie to understand and apply knowledge in promising directions of research.

References

Lao

Wang

Zhang

. Human running detection: Benchmark and baseline. Computer Vision and Image Understanding. 2016 Dec 1; 153: 143-50.

Yang

Chou

Chang

Ssu-Yuan

Guo

. A smart surveillance system with multiple people detection, tracking, and behavior analysis. in: 2006 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE. 2016 Apr 25; 1-4.

Marsden

McGuinness

Little

O’Connor

. Holistic features for real-time crowd behaviour anomaly detection. in: 2016 IEEE International Conference on Image Processing (ICIP). IEEE. 2016 Sep 25; 918-922.

Yang

Liao

. An efficient anomaly detection approach in surveillance video based on oriented GMM. in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2016 Mar 20; 1981-1985.

Al-Nawashi

Al-Hazaimeh

Saraee

. A novel framework for intelligent surveillance system based on abnormal human activity detection in academic environments. Neural Computing and Applications. 2017 Dec 1; 28(1): 565-72.

Nazare Jr

Schwartz

. A scalable and flexible framework for smart video surveillance. Computer Vision and Image Understanding. 2016 Mar 1; 144: 258-75.

Devanne

Berretti

Pala

Wannous

Daoudi

Del Bimbo

. Motion segment decomposition of RGB-D sequences for human behavior understanding. Pattern Recognition. 2017 Jan 1; 61: 222-33.

Sivarathinabala

Abirami

. Human interaction recognition using improved spatio-temporal features. in: Proceedings of 3rd International Conference on Advanced Computing, Networking and Informatics. Springer, New Delhi. 2016; 191-199.

Zulkifley

Samanu

Zulkepeli

Kadim

Woon

. Kalman filter-based aggressive behaviour detection for indoor environment. InInformation Science and Applications (ICISA). Springer, Singapore. 2016; 829-837.

10.

Cheng

Chen

Huang

. A hybrid background subtraction method with background and foreground candidates detection. ACM Transactions on Intelligent Systems and Technology (TIST). 2015 Oct 16; 7(1): 7.

11.

Arroyo

Yebes

Bergasa

Daza

Almazán

. Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert Systems with Applications. 2015 Nov 30; 42(21): 7991-8005.

12.

Kumar

Sureshkumar

. Abnormal crowd detection and tracking in surveillance video sequences. International Journal of Advanced Research in Computer and Communication Engineering. 2014 Sep; 3(9).

13.

Zhang

Yan

Conci

Sebe

. You talkin’to me: Recognizing complex human interactions in unconstrained videos. in: Proceedings of the 22nd ACM International Conference on Multimedia. ACM. 2014 Nov 3; 821-824.

14.

Jiang

Chen

Zhao

Cai

. A real-time fall detection system based on HMM and RVM. in: 2013 Visual Communications and Image Processing (VCIP). IEEE. 2013 Nov 17; 1-6.

15.

Tsai

Chuang

Tseng

Wang

. The optical flow-based analysis of human behavior-specific system. in: 2013 1st International Conference on Orange Technologies (ICOT). IEEE. 2013 Mar 12; 214-218.

16.

Cristani

Raghavendra

Del Bue

Murino

. Human behavior analysis in video surveillance: A social signal processing perspective. Neurocomputing. 2013 Jan 16; 100: 86-97.

17.

Suk

Ramadass

Jin

Prabhakaran

. Video human motion recognition using knowledge-based hybrid method. in: 2010 IEEE International Symposium on Multimedia. IEEE. 2010 Dec 13; 65-72.

18.

Elarbi-Boudihir

Al-Shalfan

. Intelligent video surveillance system architecture for abnormal activity detection. in: The International Conference on Informatics and Applications (ICIA2012). 2012 Jun; 102-111.

19.

Niu

Jiao

Han

Wang

. Real-time multiperson tracking in video surveillance. in: Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint. IEEE. 2003 Dec 15; 2: 1144-1148.

20.

Dedeoglu

. Moving object detection, tracking and classification for smart video surveillance. Master’s Thesis, Bilkent University, Ankara. 2004 Aug.

21.

Chaquet

Carmona

Fernández-Caballero

. A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding. 2013 Jun 1; 117(6): 633-59.

22.

. A survey on behavior analysis in video surveillance for homeland security applications. in: 2008 37th IEEE Applied Imagery Pattern Recognition Workshop. IEEE. 2008 Oct 15; 1-8.

23.

Mishra

Saroha

. A study on classification for static and moving object in video surveillance system. International Journal of Image, Graphics and Signal Processing. 2016 May 1; 8(5): 76.

24.

Thuc

Lee

Hwang

Yoo

Choi

. A review on video-based human activity recognition. Computers. 2013 Jun; 2(2): 88-131.

25.

UCF101 – Action Recognition Data Set [Internet]. Crcv.ucf. edu. 2019 [cited 18 November 2016]. Available from: https://www.crcv.ucf.edu/research/data-sets/human-actions/ucf101/.

26.

Computer Vision Datasets [Internet]. Clickdamage.com. 2019 [cited 19 November 2016]. Available from: http://clickdam-age.com/sourcecode/cv_datasets.php.

27.

Serre Lab HMDB: a large human motion database [Internet]. Serre-lab.clps.brown.edu. 2019 [cited 25 November 2016]. Available from: http://serre-lab.clps.brown.edu/resource/ hmdb-a-large-human-motion-database/.

28.

SDHA 2010 High-level Human Interaction Recognition Challenge [Internet]. Cvrc.ece.utexas.edu. 2019 [cited 30 November 2016]. Available from: http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html#%20Data.

29.

Software & Dataset – Yu Kong @ Northeastern University [Internet]. Sites.google.com. 2019 [cited 20 October 2016]. Available from: https://sites.google.com/site/alexkongy/software.

30.

Wang

Kläser

Schmid

Cheng-Lin

. Action recognition by dense trajectories. in: CVPR 2011-IEEE Conference on Computer Vision & Pattern Recognition. IEEE. 2011 Jun 20; 3169-3176.

31.

Lin

Hsu

Lin

. Recognizing human actions using NWFE-based histogram vectors. EURASIP Journal on Advances in Signal Processing. 2010 Feb 1; 2010: 9.

32.

Veeraraghavan

Roy-Chowdhury

Chellappa

. Matching shape sequences in video with applications in human movement analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005 Dec; 27(12): 1896-909.

33.

Poppe

. A survey on vision-based human action recognition. Image and Vision Computing. 2010 Jun 1; 28(6): 976-90.

34.

Wang

Schmid

. Action recognition with improved trajectories. in: Proceedings of the IEEE International Conference on Computer Vision. 2013; 3551-3558.

35.

Wang

. Intelligent multi-camera video surveillance: A review. Pattern Recognition Letters. 2013 Jan 1; 34(1): 3-19.

36.

Vezzani

Baltieri

Cucchiara

. People reidentification in surveillance and forensics: A survey. ACM Computing Surveys (CSUR). 2013 Nov 1; 46(2): 29.

37.

Sempena

Maulidevi

Aryan

. Human action recognition using dynamic time warping. in: Proceedings of the 2011 International Conference on Electrical Engineering and Informatics. IEEE. 2011 Jul 17; 1-5.

38.

Implementation H. HOG (Histogram of Oriented Gradients) with Matlab Implementation [Internet]. Farshbafdoustar.blogspot.com. 2019 [cited 19 December 2016]. Available from: http://farshbafdoustar.blogspot.com/2011/09/hog-with-matlab-implementation.html.

39.

Dawn

Shaikh

. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. The Visual Computer. 2016 Mar 1; 32(3): 289-306.

40.

Vijayakumar

Nedunchezhian

. A study on video data mining. International Journal of Multimedia Information Retrieval. 2012 Oct 1; 1(3): 153-72.

41.

Wang

Kläser

Schmid

Liu

. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 2013 May 1; 103(1): 60-79.

42.

Wang

Kläser

Schmid

Liu

. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision. 2013 May 1; 103(1): 60-79.

43.

Cheng

Cai

. Stratified pooling based deep convolutional neural networks for human action recognition. Multimedia Tools and Applications. 2017 Jun 1; 76(11): 13367-82.

44.

Laptev

Marszalek

Schmid

Rozenfeld

. Learning realistic human actions from movies.

45.

Master control Room: Mistral – Critical Infrastructure Protection, Safe city [Internet]. Mistral Solutions. 2019 [cited 19 December 2016]. Available from: https://www.mistralsolutions.com/homeland-security/solutions/master-control-room/.

46.

Shu

Yun

Samaras

. Action detection with improved dense trajectories and sliding window. in: Workshop At the European Conference on Computer Vision. Springer, Cham. 2014 Sep 6; 541-551.

47.

Murthy

Radwan

Goecke

. Dense body part trajectories for human action recognition. in: 2014 IEEE International Conference on Image Processing (ICIP). IEEE. 2014 Oct 27; 1465-1469.

48.

Jianbo

Tomasi

. Good features to track. in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1994 Jun 21; 593-600.

49.

Yang

Chou

Chang

Ssu-Yuan

Guo

. A smart surveillance system with multiple people detection, tracking, and behavior analysis. in: 2016 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE. 2016 Apr 25; 1-4.

50.

Recognition of human actions [Internet]. Nada.kth.se. 2019 [cited 30 November 2016]. Available from: http://www.nada.kth.se/cvap/actions/.

51.

Shechtman E. Space-Time Behavior Based Correlation [Internet]. Wisdom.weizmann.ac.il. 2019 [cited 2 December 2016]. Available from: http://www.wisdom.weizmann.ac.il/∼vision/BehaviorCorrelation.html.

52.

Nourel

Slimani

Benezeth

Souami

. Human interaction recognition based on the co-occurence of visual words. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014; 455-460.

53.

Soomro

Zamir

. Action recognition in realistic sports videos. in: Computer Vision in Sports. Springer, Cham. 2014; 181-208.

54.

Tsoi

. Motion boundary trajectory for human action recognition. in: Asian Conference on Computer Vision. Springer, Cham. 2014 Nov 1; 85-98.

55.

Software Tools for Vision [Internet]. 1995 [cited 20 January 2017]. Available from: http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/CLARK1/cvsoftw.htm,

56.

Action Recognition Page [Internet]. Www2.eecs.berkeley.edu. 2019 [cited 20 January 2017]. Available from: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/action/.

57.

Action Recognition Page [Internet]. Www2.eecs.berkeley.edu. 2019 [cited 20 January 2017]. Available from: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/action/.

58.

Computer Vision System Toolbox [Internet]. https://in.mathworks.com. 2019 [cited 19 April 2018]. Available from: https://in.mathworks.com/help/vision/ug/image-classification-with-bagof-visual-words.html.

59.

josephmisiti/awesome-machine-learning [Internet]. GitHub. 2019 [cited 21 December 2018]. Available from: https://github.com/josephmisiti/awesome-machine-learning.

60.

Viola

Jones

. Rapid object detection using a boosted cascade of simple features. in: IEEE. 2001 Dec 8; 511.

Survey and analysis of human activity recognition in surveillance videos

Abstract

Keywords

1. Introduction

Table 1 Interrelated video processing steps in various related domains

2.1 Background knowledge

2.1.1 Video processing domains

3. Taxonomy of HAR

4.1 Single person activity recognition

4.2 Two or multiple people interaction and crowd behavior

4.3 Abnormal activity recognition

5. Survey and analysis of person detection

5.1.1 Haar-like features

5.1.3 Principal component analysis (PCA)

5.2.1 Static camera

6.1 Survey and analysis of related problems in HAR using video processing techniques

6.2 Survey of related problems in HAR using 3-D space-time volume features

Table 2 Survey of HAR for single person activities

6.3.1 Feature detector

7. Survey and analysis of models used for activity classification

7.1 Support vector machine (SVM)

7.3 Hidden markov model (HMM)

8.1 Survey and analysis of tools

9. Proposed work

9.1 Spatio-temporal interest points detection

9.2 Feature points tracking using optical flow

9.3 Compute feature descriptor

9.4 Feature vector encoding

9.5 Classification

10. Conclusion

References

Table 1
Interrelated video processing steps in various related domains

Table 2
Survey of HAR for single person activities