Abstract
Having an autonomous system to alarm for violence or suspicious incidence could greatly strengthen the security system. Such autonomous system could also be useful for other application such as patient monitoring, retail shop, and children surveillance. However, the current technology has not yet reach the level to effectively analyze the video since currently most video surveillance system could not understand the events happen in the video. Complex changes in environment caused by camera motion, dynamic scene such as crowds, changes in lighting intensity, viewing from different angles, wide variation in spatial (e.g. size of interest subject relative to video) and temporal (speed of the subjects in performing actions) make video analysis task a very challenging task. Even with these difficulties, researches in improving video analysis methods are still being actively explored. Some research approaches in violence incidence detection resembling the method used in detecting abnormal incidence. Instead of detecting whether an incidence have occurred, we attempt to build a model to detect the actions related to violence. In this paper, an online detection model is built to detect specific action related to violence actions. The model is built with reference of the image object detection (Faster-Region Convolution Neural Network, Faster-RCNN) and video action detection (Tube-Convolution Neural Network, TCNN).
Introduction
Recently, there has been a lot of research works on video understanding related problem due to the increase demand of video analytics. The general tasks in this problems includes extracting video features representation [1, 2, 3], detecting actions [4, 5, 6], detecting anomalies [7, 8] and description tagging [9]. One of the motivations behind these researches is to enhance the surveillance system to autonomously detect specific event occurrence in video [6].
For detection tasks, it is commonly relates to video feature extraction, proposal generation, localization and classification [4, 10, 11]. Video action feature extraction itself is one of the popular research fields. In earlier approaches, various kind of hand-craft methods were proposed. However, hand craft features were considered not sufficiently discriminative [14, 15, 16], because they capture low level feature which may looks similar as the action classes grow. Recent approaches mostly employ the deep learning approaches to extract video action features [1, 3]. Deep learning approaches could be trained to extract both low and high level features [1]. Furthermore, most of the deep learn based classification approaches obtain better result than hand craft method [1, 3, 17]. After extracting the video features, a set of proposals can be generated to roughly capture the actions within the scene [4, 10] for the purpose of action localization and classification.
Attempts in detecting violence or suspicious action are still rather uncommon. Most of the available methods are implemented as abnormal or anomaly detection [8, 18, 19, 20]. However not all anomaly is related to violence. In some specific context such as ATM surveillance, the anomaly could be directly related to suspicious event [20]. Approach proposed in [21] is trained on Hockey fight dataset which directly related to violence detection.
In this paper, focus will be given onto specific suspicious actions such as kick, punch, hit, shoot and throw. The task is almost similar to the typical classification or detection tasks but focusing on action classes which related to violence action. The detection model designed mainly uses the region proposal network in faster-Region Convolution Neural Network (faster-RCNN) [10] with Tube-Convolution Neural Network (T-CNN) model. Faster-RCNN is proposed to detect objects from images, while T-CNN is proposed for offline action detection which requires full length video as input. Some minor change is made on the pooling to enable the model to work in online mode. In this paper, we modify the pooling operator to take both new input clips and previous pooled result as input. When consecutive clips are feed into the network, the strong activation from different clips will be pooled into a fixed size output. The result of this accumulated pooling should be similar to the original pooling proposed in T-CNN when full length video input is used as input. Even without full length video feature, the accumulated pooling feature should be able to capture some feature of full length action. This could help the model to recognize an action without full action video. In this paper, we described previous related work in Section 2. We further describe our proposed network in Section 3, and the results and discussions are presented in Section 4. Lastly , we conclude our work in Section 5.
Related work
Many efforts have been made on finding a good video action representation [1, 3, 12, 22, 23]. In earlier approaches, most features were represented by hand-crafted features. Spatial Temporal Interest Points features (STIP) [12] detects the Harris 3D corners at multiple spatial-temporal scales. Each point is described by Histogram of Gradient (HOG) and Histogram of Optical Flow (HOF). In 2013, Dense Trajectory [13] feature was proposed. Unlike STIP, Dense Trajectory samples the video trajectories of whole video frame with equally spaced x and y interval. The features are described by HOG, HOF and Motion Boundary Histogram (MBH). From their experiment, they found that dense sampling could outperform point based feature. MBH descriptor outperforms HOF partly due to the camera motion. In their later work, Improved Dense Trajectory (IDT) is proposed [24]. IDT incorporates the motion cancelation with MBH and human detector. Other action classification approaches often report IDT could improve their classification accuracy [17, 25, 26].
Most Hand-crafted features are low level features that are insufficiently discriminative [14, 15, 16]. More discriminative hand craft feature, Action Bank is proposed in [15] as high level hand craft feature. It works by matching input feature to a set of template action features in action bank. Tested on KTH and UCF sport dataset, action bank could perform better than the STIP. However, the mean accuracies of Action Bank and STIP on HMDB51 dataset are only 26.9 and 20.2% respectively. MotionLets proposed by [14] as mid-level feature which measures and rank the saliency motion feature performs better than other hand craft features , with the score on HMDB51 of 33.7%.
Other than Hand-crafted feature, deep learning action feature are also often used in recent video based problems [1, 9, 27]. Inspired by the success of Deep Learning in image based problem, efforts of adapting image deep learning model to video is proposed recently [1, 5, 28]. Different deep learning networks are proposed in [28] to extract feature from multiple frames at different stages. Multi-stream deep learning feature is also proposed to extract spatial and temporal feature separately [3, 29]. The spatial feature extractor often derived from existing pretrained ImageNet [30] deep learning model, while the temporal stream are trained to capture temporal feature. In [3], the temporal stream takes optical flow of the video as input. These models also contain fusion layers to integrate spatial and temporal feature together before feeding to classifier. Both [3, 29] shows that temporal features are necessary for action classification tasks. Convolution 3D (C3D) deep learning model is proposed by using 3D convolution filter kernels and 3D max pooling operators [1]. C3D model is much simpler compare to other video base deep learning models. It only uses 3D convolution filter, 3D max pooling and ReLU activation function. Similar to the findings in VGG-14 [26], C3D also use small filter kernel such as 3
Performance summary of various video classifier and detection
Performance summary of various video classifier and detection
Other than just classifying actions in videos, some approaches work on localizing (detecting) or generating proposal volumes to contain actions. Inspired by the success of image VGG network [23], Action Tublet Detector (ACT) [31] used VGG network with 2 convolutional layer branches. One layer is for scoring action purpose, while another one is for localizing the tube to contain the complete action. The drawback of this method is the challenge in the regression of the action tubelet when the actor is moving away from the initial location of the box.
An attempt to detect and localize action with actor’s pose in online manner is proposed in 2016 [6] by employing super pixel extraction and convolutional pose machine [32] estimator. The foreground and background super pixel are estimated to refine the estimation of actor’s poses. To classify the actions, they proposed the Structural-SVM which could increase the classification confidence over the time as more frames of the action are collected. However this model performance is largely affected by the initial prediction of actions, and would affect the overall accuracy of the localization and prediction.
The Tube Convolutional Neural Network (T-CNN) is proposed in [5]. Within T-CNN, Tube Proposal Network (TPN) is built from the C3D and proposal network. Similar to Faster-R-CNN (Region Proposal Network), the C3D features is used to estimate foreground and background class confidence along with a proposal bounding boxes estimation. The proposal boxes are either dropped or feed to the next stage based on the action score, and then processed by a set of fully-connected layers to localize the actions tubes. These tubes are linked with consecutive clips based on their overlapping criterion and actioness criterion. They also introduced Tube-of-Interest (ToI) pooling as a 3D version of Region of Interest (ROI) Pooling from Fast-RCNN [33]. ToI pooling samples the maximum response from changing sized tube features map and produce a fixed length action features. The tested classification score is around 7% higher than the original C3D classifier [1].
Violence scene detection method is proposed in [21]. The feature is represented in form of hand-craft STIP and Motion Scale Invariant Feature Transform (motion-SIFT). Extracted feature is classified by SVM. This model is trained and tested on their Hockey Fight dataset. Histogram of Gradient (HOG) also used in [20] with Random Forest Classifier to detect Abnormal event on their ATM data set. More related to scene anomaly detection [8] proposed a method to use Alexnet feature on auto-encoder. The auto encoder is trained with normal video data. During testing with anomaly video the auto encoder produces a distinguishable feature. This abnormal feature is detected with Gaussian distribution.
The detection network used in the experiment (classification network in Fig. 4).
To account the shortage of video action annotations [34] proposed a model to localize actions in videos from images. Large sum of images related to actions are obtained from Google image search. Noisy images are eliminated with random walker algorithm by discovering outlier clusters. Image based object detector [35] is used to localize actors or action performed within images. The image action features is extracted with image CNN. Actions proposals detected in a video are ranked to obtain localized action videos. This ranking is worked by comparing the extracted Google image features with the image features extracted from the proposals detected in video.
Earlier dataset like KTH (6 actions), IXMas (13 actions) and Weizmann (10 actions) were regularly used as testing and training benchmark [36]. Most of the classifier including earlier hand-crafted model [12, 14, 15] could achieve mean accuracy higher than 90%. CAVIAR dataset was one of the early dataset introduced in 2003 and 2004 [37]. It is recorded from acting different scenarios such as walking alone, meeting with others, window shopping, entering and exiting shops, fighting, passing out and leaving a package in a public place from a fixed CCTV.
Recently introduced action dataset is more challenging and comes with more action classes. The current commonly used datasets in video classification and localization are HMDB51 [38], UCF101 [39] and YouTube [40]. These new dataset contain more complex actions. This complexity comes from moving camera, multiple actors, changing viewing angles and changing lighting. Some of the action classes may contain some similarities such as walking and running actions [16].
The image above illustrate data flows in Region Proposal Network layers.
The HMDB51 consists of 51 different classes, mostly from movies. Each class contains at least 101 clips. It is one of the most challenging dataset available publicly. As shown in Table 1, testing accuracy on this dataset is usually lower than other datasets. J-HMDB dataset is the subset of HMDB51 which give more focus on the action of upper body part. UCF 101 is a large sport video dataset with 101 classes from 13320 videos. YouTube-8M (8 million videos) dataset contains 450000 hours of videos for 4716 classes from public video available in YouTube video database. This amount of video makes Youtube-8M the largest video database available.
More on violence related video dataset are available from VSD [41] and Visilab [21]. Visilab contributed a fighting dataset from hockey gameplay recording in 1000 short clips. Visilab prepared 100 fighting scenes dataset from movie accompanied by non-fighting scene for negative sample training. VSD dataset consists of both Hollywood (from 31 movies) and Youtube (86 videos) fighting scene video.
As a summary, different approaches to extract action representative features were introduced to recognize different actions. With more developed framework, these features have been used to perform more complex tasks. One of these tasks is the action detection task. This task generally involves feature extraction, regression capture foreground actions, and classification to identify the actions. Although deep learning related frameworks have been producing some improvement, they however require large amount of training datasets. Newer video action datasets mainly are taken from movies which contain more realistic samples. However most of them are not welly annotated.
The proposed network is mainly derived from the T-CNN [5]. T-CNN is chosen as the base framework because of its achieved detection accuracy, speed and simplicity. In T-CNN, the proposal generation is performed through sliding window method. While in our proposed network, we replace the proposal generation network with Region Proposal Network (RPN) to speed up the proposals generation process. The RPN was proposed in Faster-RCNN [10] as image object proposal generation. The structure of action proposal and localization network is shown in Fig. 1. Furthermore the original T-CNN was proposed as offline detection model. In this paper, we modify the T-CNN’s pooling and tube linking operator to accumulate the convolution feature response. The modification is made at the classification network to allow the network to operate as online detection model. The structure of classification network is shown in Fig. 4.
Feature extraction
In this paper, we applied T-CNN to extract relevant features. Following the approach in T-CNN, Each pretrained C3D network take input of 8 frames clip. Each convolution layers in C3D mainly consists of Convolution, Pooling and Activation function. The C3D extraction network consists of 5 layers of convolution layers consisting of 3D filter kernels, followed by 3D max pooling and also ReLU activation function. The output of C3D features after the fifth convolution (Conv5b’s) serves as the main feature for proposals generation. Besides, the second convolution layer Conv2a feature is also used as localization and classification features.
Region proposal network
Region Proposal Network was proposed in faster-RCNN [10] as proposal generation network. The generated proposals provide rough estimation for foreground regions. Similar to the faster-RCNN, the RPN is used to generate action proposal in this model. The first layer of RPN in Fig. 2 (RPN-Conv/3
Based on the implementation of T-CNN and DarkNet [11] in object detection network, a number of bounding box size ratios are selected using K-means from ground truth annotation boxes. This will speed up the training process by initializing the bounding boxes at better width and height scale. The bounding boxes coordinate are represented as a 4-D vector
The “rpn bbox pred” layer estimates the deltas
Where
When training RPN,
Illustration of bounding boxes generated for a cell on the feature map.
A number of bounding boxes with best IOU are selected and the rest are treated as background. In T-CNN approach, the background is selected by comparing “RPN Cls Score” with the expected class feature. The prediction with high foreground confidence but overlapped badly with ground truth will be chosen as background. And the prediction that highly overlapped with a back-ground annotation box will be selected as hard negative. Some video may contain multiple actors performing the actions to be detected (e.g. kick, punch), while others may perform uninterested actions or background action (e.g. standing, sitting). Hard negative annotations are mainly annotating people that are performing background action. The hard negative helps RPN to differentiate between foreground and background action.
Besides, the number of selected background and foreground boxes must be equally distributed. Otherwise the network might learn to produce more negative prediction to achieve low loss instead of learning to predict correct result. This may happen as background appears much more often than foreground. To train “RPN Bbox Pred” layer, the transformation deltas are computed for each selected foreground boxes. The transformation deltas can be computed as follows:
where
After training process, the prediction of the RPN layers can be used to generate proposal boxes. These proposal boxes capture the potential for classification. At each pixel on the RPN feature map, the estimation of bounding boxes are computed with Eqs (1) and (2). The RPN tends to generate large amount of proposals, therefore Non-Maxima-Suppression (NMS) is used to reduce overlapping proposals. It will remove highly overlapped boxes (when 2 boxes overlapped more than a threshold).
After the proposal generation and filtering, the proposal bounding boxes coordinates are used to crop the feature map in Conv2a and Conv5b. This method was proposed in T-CNN as temporal skip pooling. It is used to recover the loss of temporal information in Conv5b by sampling Conv2 feature. These features are cropped out with the TOI Pooling layer based on the proposal boxes.
For each proposal boxes, the feature in C3D’s, Conv2a and Conv5b are cropped and Max-Pooled into fixed tube size of 1
Action classification
The block diagram in Fig. 1 summaries the detection network up until localization network. After the obtaining the localized tubes, the feature within these tubes are used to classify the actions. Besides, the tubes from consecutive predictions may be generated from the same actions. These tubes are linked depending on overlapping score and average foreground score. Overlapping score comes from the overlap between last frame of first tube and first frame of second tube. The linked tubes are selected with the following equations:
Where
In T-CNN framework, localized tubes are generated from TPN network. Consecutive tubes from
To adapt the T-CNN for online detection, the linked action score, overlap score and ToI pooled features are always saved into a buffer. When a tube is linked with previous clips, the new tube features is pooled into the buffer. Distinct features of an action is expected to accumulate in this buffer as an action is observed for a few clips. This pooled tube buffer feature is used as input to the classifier. A 1
Number of training clips used in the experiment according to action class
Dataset
The dataset used in this experiment are taken from HMDB51, VSD’s movie fighting scene and hockey fighting scene. For this experiment, only 6 classes are manually annotated (kick, punch, hit, shooting, throw and background actions). The annotated videos are sliced into 8-frame clips. Each clip will have at least one or more foreground annotation. A total of 1785 clips are annotated. For each class, 10 random clips are removed from training dataset for testing. The standard test train ratio is not used in this experiment because of lacking of training sample. We show the details of the data in Table 2.
Cross validation True/False Positive/Negative condition table
Cross validation True/False Positive/Negative condition table
Additionally, if the IOU is below threshold (0.5), it is considered as False Positive if its score
The videos sequence is also randomized during the training. The trained action proposal network is used to extract the video features for localization network training. While the trained action localization network is used to extract the video feature for classification network training. The extracted features for classification network are shuffled and combined to form 355 batches dataset. Each batch contains the feature sample for 6 different action classes. Some action features are not used in the classification trainings to regulate the sample size for each action class.
The network is implemented with C3D Caffe branch. The training is carried out separately (action proposal, localization and classification network) to ease the debugging process. The pretrained C3D weights on Sport 1M dataset is fine-tuned on the dataset while training action proposals. The proposal features are extracted to train the localization network. The features extracted are used to train the classification network.
The action proposal and localization networks are trained with single batched SGD optimization. The learning rate used is 0.001, momentum of 0.9, weight decay of 0.0005 following the standard setting [30]. The learning rate proposed in [30] is 0.001 but changed to 0.0001 in this experiment to prevent unstable learning loss. The weights are updated in every backward iterations (1775 weight update chances).
For classification network, the training batch size is set to 6. Using batched training produces more stable loss curve, but it reduces the number of back-propagations and weight updates. Adaptive Momentum (Adam) and Adaptive Gradient (Adagrad) optimization is used in training because of faster drop in training loss. The weights parameters are updated in every iteration.
Result evaluation metrics
Precision-Recall curve (PR curve)
The accuracy of the detection is commonly measured with Precision-Recall method used in PASCAL Visual Object Classes VOC Challenge [44]. Recall,
Precision-Recall (
Block diagram above show structure of action classifier.
The testing Recall-Recall curve of the action proposal network.
The action proposal testing ROC curve.
Example output of the RPN during testing. Left column shows the foreground confidence (enlarged 16 times to match the video size).
Some example of testing output. Images on left are proposal bounding boxes. Images on right are the localized proposals with non-maxima suppression. The localization manage to localize some proposals to better capture the actors.
Failing cases of the RPN and localization.
Training loss of the classification network.
ROC plot and Recall-Precision curve of the overall detection accuracy.
The accuracy precision (AP) for a class prediction is calculated by finding the mean of the Precision with equally spaced Recall level. This metric is mainly used to compute the accuracy for a two class prediction (foreground and background). To compute the AP for multiclass task, the AP is evaluated separately. And the mean AP (mAP) is computed by taking the average of the AP computed for each class.
ROC Accuracy metric is also used in some papers of similar field [1], [4, 5]. Using back the similar confusion matrix (Table 3), the ROC are plotted with sensitivity and 1 – specificity.
The ROC is often related to another accuracy metric AUC which is computed from finding the area under ROC.
Action proposal
The Recall-Precision plot in Fig. 5 shows the accuracy of the foreground/background detection by varying threshold value,
The ROC curve plotted in Fig. 6 shows the relationship between true positive (
The brighter region in the foreground confidence image (Fig. 7) is formed by averaging the confidence score of 12 boxes. This confidence map shows the location the proposal boxes spawn. Ideally, it should spawn near to an actor so the RPN proposal localization and the TPN localization could be much easier to predict. The trained RPN tend to capture human or motion within the scene. The green boxes are proposal boxes spawn from boxes higher or equal to 0.7. The RPN still captures the background actors as foreground, as shown in the fourth row of Fig. 7. This is most probably due to insufficient samples of dataset that come with background actors.
Localization and classification
The proposal network estimates the rough position of foreground action (left column in Fig. 8). These proposals are filtered by a non-maximal suppression to reduce overlapping proposals. The filtered proposal box is used to crop the C3D features for localization network. The example output of the localization network shown in right column in Fig. 8. These localized tubes are again filtered by another non maxima suppression to remove highly overlapping tubes. The localization’s performance is highly depending on the proposal estimations. Failure of proposal network in detecting the shooting action in Fig. 9 causes the failure in localizing the action.
The best testing accuracy achieved is from the classifier trained with AdaGrad optimization. Adam trained classifier has better accuracy at low recall rate (recall
For both trained classifiers, most prediction are predicted with low confidence score because they are still lacking of training samples. The mean ROC’s area under curve (AUC) achieved by AdaGrad and Adam trained classifier are 0.74 and 0.66 respectively. While the mAP for AdaGrad and Adam trained classifier are 0.39 and 0.34 respectively.
Discussion
The trained Action Proposal network can estimate the foreground actions, as shown in Fig. 7. The localization network manages to transform the proposal boxes to obtain better a better overlap IOU (in Fig. 8). The annotation was made with intention to capture certain foreground action. However, the network still detects humans in video instead of the actions, as shown in Fig. 7. This might be caused by the rectangle annotation that always contain human body regardless of the action performed and lack of hard negative training samples.
The classification network accuracy is rather low. This is most probably due to lack of training samples. From Fig. 11 both AP and ROC curves show class prediction is slightly better than random classifier except the background class where the ROC false positive
From the training loss in Fig. 10, it shows drop out could help in reducing the training loss at faster rate. The drop out disconnection might have emphasized the losses and updates of the connected weights. While training without drop out, most of the weight’s update hold a non-zero value. After normalizing, each weight update becomes smaller. The drop out also introduces more spikes in the training loss, because every iteration, the network is training a different network. The effect of drop out in reducing overfitting is not observable in this case because of low training sample.
Adam is slightly slower than Stochastic Gradient Descent (SGD) in computation speed, but its loss converging speed is the fastest. Regardless of the optimizer used in training, the loss plots in Fig. 10 shows the loss can still reduce with more training samples. When comparing Adam and AdaGrad training loss, there are some obvious spikes appear at the same training iteration. This shows that the training data at that training iteration is rather noisy or different from other samples.
Classification and detection of actions were conducted in others’ works. However annotation and benchmark result for specific violence related actions is not found or publicly available. Furthermore this experiment is conducted in small scale is not comparable with other general action classification benchmarks such as HMDB51.
Conclusion and future work
In this experiment, minor changes have been made on the T-CNN’s [5] tube linking and ToI pooling operator to accumulate the responses from consecutive tubes. The proposal and localization network capture some actors performing the actions (kicking, punching, hitting, shooting and throwing). From the testing result, it shows that the proposal can detect and differentiate foreground and background action with 0.6625 mean AP. However the classifier network training could not sufficiently converge the loss because of lacking of training sample. The best detection result obtained with mean AUC (area under ROC curve) and mAP (mean class precision in Recall-Precision curve) are 0.74 and 0.39 respectively.
Localization with rectangle boxes and tubes are simple. However it may include large area of background elements into foreground annotations. Annotating only part of the actor’s body that reflect an action. This may help the proposal network at capturing actions related parts instead of human body. Besides, annotating actor with rectangle box may include large portion of background.
In countering the issue of insufficient data, data augmentation is often used in machine learning. Augmenting video with frame striding was proposed in [29]. Other than that, the image augmentation approaches could apply for video, e.g. random horizontal flip and random patch cropping proposed in [30].
The whole network is separated into 3 parts (proposal, localization and classification). The trainings in later parts does not reflect back to earlier parts of the network. Training the network in End-to-end manner may yield a more accurate result.
Footnotes
Acknowledgments
This work is supported by Malaysia Ministry of Higher Education (MOHE) under Fundamental Research Grant Scheme (FRGS) (ref: FRGS/1/2016/ TK04/TARUC/02/1).
