Abstract
Violence detection is a challenging task in the computer vision domain. Violence detection framework depends upon the detection of crowd behaviour changes. Violence erupts due to disagreement of an idea, injustice or severe disagreement. The aim of any country is to maintain law and order and peace in the area. Violence detection thus becomes an important task for authorities to maintain peace. Traditional methods have existed for violence detection which are heavily dependent upon hand crafted features. The world is now transitioning in to Artificial Intelligence based techniques. Automatic feature extraction and its classification from images and videos is the new norm in surveillance domain. Deep learning platform has provided us the platter on which non-linear features can be extracted, self-learnt and classified as per the appropriate tool. One such tool is the Convolutional Neural Networks, also known as ConvNets, which has the ability to automatically extract features and classify them in to their respective domain. Till date there is no survey of deciphering violence behaviour techniques using ConvNets. We hope that this survey becomes an exclusive baseline for future violence detection and analysis in the deep learning domain.
Introduction
Human behaviour is complex. Group of humans for a common cause can behave in different manner to make their agendas visible to the world. One of such behaviour is violent behaviour which has become more common in the world for various reasons. Most affected places where violence can create panic are the public places. Roads, malls, railway station, busy intersections of the cities and many more becomes the target of the violent crowd behaviour. Violence in public places is an unacceptable behaviour which is a threat to life for all the people. Public property is lost to arson, riots and so on. Computer vision provides us the platform on which we can decipher crowd behaviour and can take cues from the developing situations for maintaining law and order in the respective place. Real time detection of violent activities can immensely benefit the city administration. There have been two ways that have become the norms of deciphering these violent activities. One way is to use shallow features or handcrafted features for detection of violent behaviour. Another popular way is the usage of deep learning methods and its tools for violent behaviour detection. Deep learning systems require multiple layers of interaction between the processed data. Handcrafted features were needed to be devised to develop discriminant features. Deep learning has made this learning self-sufficient. Thus, the system has evolved from the shallow approach to automatic classification approach using deep learning methods and tools. Deep learning is all about finding and exploiting the discriminant feature space. These feature spaces need to be handled in a structured way for optimal classification solution. These discriminant feature undergoes multiple layered operations and classification algorithm [1] [2] for challenging computer vision problems. The deep learning concepts [3–7] along with machine learning classification supports have actually provided tremendous impetus in solving challenging problems. Machine learning provides solutions to go through feature space and does the classification in the respective domains. Hybrid architectures [8] are a new addition to deep learning-based solutions.
One such tool of deep learning which has become immensely popular is Convolutional neural networks (CNNs). Different analysing techniques such as principal component analysis, clusters of patches of images, dictionary-based approaches, sparse representations and so on have been a classification choice of CNN methods [2]. A preliminary introduction of the CNN applications has been discussed in [9]. A medical application was reported way back in 1995 [10]. A popular application of hand written digit recognition known as LeNet [11] became a path breaker in field of CNN.
An excellent contribution to the computer vision challenges was provided by [12] in a unique challenge known as ImageNet challenge where AlexNet, was proposed for the first time. This paved the path of utilization of Convolutional Neural Network further in the deep learning approaches. Thereafter deeper architectures were proposed [13] and have been deployed in various application areas. A range of practical and research-oriented applications using deep learning techniques have been described in [14–21] have been reported in the literature which consist of pattern recognition, mathematical model development, data mining, artificial intelligence and so on. We have established the premise that Convolutional neural networks (CNNs) is the tool through which we proceed our survey on violence analysis.
Violence is defined as the intentional use of physical force or power, threatened or actual, against oneself, another person, or against a group or community, that either result in or has a high likelihood of resulting in injury, death, psychological harm, maldevelopment or deprivation [22]. Violence analysis dissection based comprehensive overview of ConvNets from different angles. We try to build up a taxonomy which can motivate the researchers to look towards ConvNets as a viable option for their research. Since our focus is entirely on ConvNets based methods for violence analysis we discuss their performance and various challenges that incite researchers to design more effective algorithms. A brief touch upon the popular datasets for violence analysis process. Open discussion on the future aspects of the violence analysis process.
The structure of the paper is as follows: Section 2 describes a very brief introduction of Convolutional Neural Networks. Section 3 focusses on the motivation for violence analysis. Section 4 discusses various latest Convolutional Neural Networks (CNN) based violence analysis methods backed by the popular datasets. Section 5 provides a brief overview of the quantitative and qualitative results of various violence analysis methods. Section 6 discusses the future research directions in violence analysis. We conclude in the last section 7.
Convolutional Neural Network: A Brief Overview
In this section we discuss about the ConvNets and its details so as to appreciate the understanding of using ConvNets for the vision problems. In the last decade, there has been positive intrusion of deep learning methods in human lives [23–25]. Novel methods for pattern recognition tools [1, 26] have been reported in the literature which utilizes the supervised learning methods. Auto-encoder [28] and Deep Belief Networks (DBNs) [25, 26] are popular unsupervised learning methods. Hubel et al. discussed usage of neural layered architecture [29]. LeCun et al. proposed neural architecture called LeNet 5 [11] which was used for recognition of hand-written digits and words with 99.2% accuracy on the popular MNIST dataset [30]. Convolution has been referred in literature as cross-ventilation [31]. A series of deeper architecture were proposed such as AlexNet [12], ZFNet [32], VGGNet [33], GoogleNet [34] and ResNet [35]. Various variations of ResNet [35] have been reported in [36–40]. The latest addition to these networks have been Residual attention network [41]. Further Convolutional Neural Network architecture were ably supported by other deep learning streams of Recurrent Neural Network [42] and LSTM [43]. New addition to the above architecture is the usage of attention-based networks architectures for analysing and predicting discriminant features. Vital attributes of CNN are convolutional, pooling, and fully-connected layers. Their types and applications are described in Table 1.
Attributes of ConvNets [44]
Attributes of ConvNets [44]
Convolutional neural networks (CNNs) are usually deployed at a uniform cost. The modern hardware now consists of GPU (Graphical Processing Unit). This hardware has great computation power. Some of the popular software frameworks in the deep learning domain on which Convolutional neural networks (CNNs) can be executed are Apache Singa [84], Caffe [85], Deeplearning4j [86], Dlib [87], Keras, Microsoft Cognitive Toolkit [88], MXNet [89], OpenNN [90], TensorFlow [91], Theano [92], Torch [93].
Violence is described as a “behaviour which is intended to hurt, injure, or kill people” [94]. Violent behaviour can be exhibited by individual as well as crowd. A crowd can be described as consisting of large number of people displaying some sort of coherent behaviour [95]. Crowd behaviour analysis [96] becomes the base for deciphering the violence activities in the crowded domain. Diagrammatic analysis of crowd attributes is shown in Fig. 2. Most of the applications in the crowded domain tends to converge on the people detection and tracking [97]. While analysing the crowd attributes we conclude that violence pertaining to crowds could be maximum while dealing with anomaly detection [98], behavioural pattern recognition [99]. We also realize that the crowd counting applications [100] can actually give hints to increasing density among crowds which can become a potential source of triggering violence. People feel secure when they have a breathing space in a crowded area. As soon as they realise that their breathing space is encroached upon, they start taking evasive actions. Overall a proper modelling and inference system [101] should be ready with any hints of triggers of crowd violence. Below figure is a clear indication that majority of crowd analysis involves overall crowd tracking with its respective dense trajectories and motion patterns. Another feature that is involved in the system is the analysis of abnormal crowd analysis. Crowd behaviour generally boasts of different patterns of crowd movements. The data extracted from the videos is used for crowd behaviours understanding. Rabiee at al. came up with a beautiful explanation of attaching crowd emotions as attributes for crowd behaviour understanding [102]. They explored the idea of emotion-based classifiers which can be used for representing crowd motion. It can also involve crowd behaviour as can be seen in Sultani et al in their respective paper [103].

Illustration of Traditional Convolutional Neural Network (Inspired by [11]).

Crowd analysis percentage courtesy [96].

The above discussion and explanation make it clear that violent behaviour that emancipated from the crowd falls under the broader abnormality detection. We specifically look for this behaviour anomaly in crowds for violence detection systems.
Non-violent management of crowd is the key to success to manage any huge event. The large gathering of crowd can be seen in malls, clubs, stadiums, auditorium, political rally etc. All these places require efficient and real-time crowd management. In a developing nation like India, crowd gathering on roads, malls, railway stations, temples, funerals, stadiums and so on presents a huge challenge for the emergency response team to prevent any violent activities. The survey does not intend to find the ethical reasons behind the crowd violent approaches. Every country may have some burning issues based on caste, creed, race, colour, ethnicity, religions, belief system and so on. Each one the above-mentioned system can incite the crowd if any perceived injustice perception is harboured. It has been observed, in general, that panic in crowds can trigger violent activities. Violence does not erupt in isolation. Individual violence involving two people also lies in the ambit of crowd violence. Usually the theme of the violence starts from a peaceful protest, sloganeering and then rise up to the violence in which the mass participates. Violence detection modelling is absolute necessity so as to prevent loss of lives, property and make the law and order authorities in control of the situation. Effective emergency response can be initiated in the violence affected area. The aim of choosing this detection of violence theme is to minimise the loss of human lives. Few of the reasons which trigger panic reactions in public are as follows:
Violence detection strategy is the need of the hour. Simulation pertaining to crowd-based violence must be done to achieve peace in crowded places. The images in Fig. 2 shows why violence detection in the crowd becomes so important in the context of the places where large political rallies, important religious functions, and joyful festival celebrations take place. Automatic detection of violence in such places becomes extremely critical. To facilitate smooth handling of crowds we need a real-time system to ensure safety of large crowds.
Detection of violence
Automating detection of violence analysis can be termed as one of the most important aspects of violence analysis. Modern researchers are interested to detect abnormal behaviours [101] especially in the area of crowd. Abnormal behaviour detection is difficult [108], especially in a crowded place. With the increase in the usage of smartphones and high-tech cameras, video surveillance is one of the most common approaches for surveillance purposes [109]. Whenever there is a security issue the normal procedure is to install Closed-Circuit Television (CCTV). The current emphasis is on detection of violence in an area, especially in a crowded place. Video surveillance is an important tool for detecting anomalies [110]. Usually, crowd analysis is done by focussing on the crowd density and keeping a track of crowd count, detection of crowd motion [101, 111] with respect to the historical behaviour of crowd gathering at that particular places.
Previous surveys for violence analysis
Violence detection, fight detection and any abnormality detection come under a strict curriculum of crowd behaviour analysis. Table 2 presents a summary of the surveys for crowd related analysis. These surveys have ranged from analysing the crowd, crowd counting, crowd density, crowd behaviour, scene analysis since 2008. All these surveys were detailed.
Previous Crowd Based Surveys
Previous Crowd Based Surveys
As per the above Table 2, computer vision-based approaches using CNN for violence detection has been missing in all the previous surveys. Violence analysis is a fine behaviour subset of crowd behaviour analysis which needs to be studied precisely for violence outbreak behaviour. The amount of violence that is being witnessed in the world is huge. Violence harms the human lives, destroys public properties and gives a huge economic cost to the government. The above table clearly lay the premise that violence laced crowd behaviour needs to be studies using deep learning tool of ConvNets. Automatic extraction of features, data augmentation, layered deep architectures and ability to classify images and videos in real time are the best bets of future researchers. To the best of our knowledge violence analysis using CNN has been covered in a small portion in individual papers. There is a pending need to look at the violence analysis using CNN as it is one of the most accurate methods of deep leaning. This unique survey using Convolutional Neural Network shall focus on the violence detection method in depth.
This section tries to decipher the possibility of violence detection in previously presented works. As we have already specified that we are building this survey paper on the analysis of crowd behaviours as a basis of violence detection. Thus, we are also arguing the possibility of detecting violence analysis in the previous existing papers. There have been various approaches towards mass behaviour in various situations. Researchers have taken different routes to obtain behaviour analysis. These approaches ranged from applying physics phenomenon, biological domain phenomenon to the computer vision field. Table 3 presents the fact that whether violence detection strategy using CNN can be explored in mentioned papers which were inspired by the different domains of physics, biology as well as computer vision domain.
Violence exploration possibility in crowd dimensions. (P = Physics Domain, NB = Natural Biology Domain, CV = Computer Vision Domain, VA = Violence Analysis)
Violence exploration possibility in crowd dimensions. (P = Physics Domain, NB = Natural Biology Domain, CV = Computer Vision Domain, VA = Violence Analysis)
From the above table it becomes aptly clear that the idea of cross domain research using ConvNets can boost up the crowd behaviour understanding tasks. Each of these papers have highlighted the areas of physics, biology along with computer vision techniques. Violence analysis cannot be left out of these domains too. Therefore, the possibility of tracing violence analysis using the methods and discussion applied in above papers can also be a new beginning in to violence research methodology. The above table boost up the idea of cross domain search strategy of deciphering crowd behaviour. Let us now visit some of the latest papers which have presented a viable solution towards detection of violence. In this survey paper, CNN based methods are discussed and analysed in below sub-sections. Perez et al. [128] presented a novel concept by proposing a pipeline of Convnets. In this paper, the pipeline was split into three specific steps. It mainly relied on the two stream architectures, 3D CNN architectures and local interest points. Discriminative features were extracted which would help us in the classification process. The respective streams consist of 2D CNN and the 3D CNN. Feature extraction for the two stream solutions is executed by generating and incorporating the two different models. One is done using RGB and another using Optical flows. An aggregation procedure is applied on this two-stream architecture using average pooling. 3D CNN was used using a pretrained model on Sports-1M [129] and the feature were extracted from a previous layer. The longer duration of CCTV recordings is analysed for detection of actual fights. The paper presented the importance of optical flows and had major impact on the performance optimization. The paper used CCTV-Fights dataset and the metric is used for evaluation was mAP and F-measure.

CCTV-Fights dataset [128].

Hockey Fights dataset [140].

YouTube dataset [131].
A novel concept of 3D-CNN has been applied for violence detection in videos by Ding et al. [130]. Action recognition is the most researched topic in the computer vision field. The subset of fight detection also deserves due importance to detect violence. Since the 2D Convolution is great at spatial extraction of information and loses when the motion comes in to the play, we switch over to 3D Convolutional network by applying convolution to temporal sequences. This paper presents a novel 3D ConvNets model for violence detection in videos. The paper utilized Hockey Datasets dataset and the metric it used for evaluation was accuracy. Below is a sample image for Hockey datasets. The architecture used in this paper is as follows:

Movies dataset [132].

Violent interaction dataset [135].

Multiplayer Video Dataset [135].
Sumon et al. presented a novel deep learning-based violent crowd flow detection solution [131]. It explicitly cites crowd conditions of Bangladesh. The paper collected explicit dataset from YouTube consisting of violent and non-violent crowd flows. Convolutional neural networks (CNN) based methods and long short-term memory network (LSTM) based architectures have been applied separately on this dataset and in combination as well. The concept of transfer learning is used for practical solution. A pre-trained model on movie violence dataset is used. It achieved a considerable accuracy on the presented dataset. The dataset used in this paper is YouTube within the context of Bangladesh. The metric used for evaluation of the result is F-measure.

UCF 101 Video Dataset [136].
A unique approach for violence detection using modified 3D Convolutional Neural Network by Song et al. [132] was proposed. The focus of the paper was on the sampling methods of frames. It tried to focus on key frame selection. Random sampling is also adopted to produce input frame sequences. Standard popular violent detection datasets such as Hockey fight, Movies, and Crowd violence are used in this paper. Suitable variations have been performed on the length of clips on which 3D ConvNet is constructed.3D CNN was used extensively on the images to perform both the temporal and spatial convolution to come at the optimum calculations. The paper used Hockey Dataset and Movies dataset and the metric it used for evaluation was Accuracy. Below are sample images for Movies datasets.

BEHAVE Dataset [137].

MediaEval 2015 Dataset [142].

Multi-Task Crowd Dataset [144].

Violent interaction dataset (VID) [144].
Xu et al. [133] presented a novel strategy of P3D-LSTM for deep learning-based violent video classification assisted by spatial-temporal cues [133]. Multi-feature fusion methods are used for violence detection in videos. Static video frames are used for calculating the frame difference between consecutive images. Similarly, optical flow features are also calculated using frame difference approach. Thus, discriminant features are calculated. Post discriminant feature calculation, late fusion concept is applied for fusing the multiple features to obtain potent decision scores for classification labels. The experiment is conducted on standard two public databases and a self-built violent database prepared by Xu et al. The metric used for ascertaining the results is Accuracy. Zhou et al. discussed violent interaction detection in video surveillance for places such as railway stations, prisons or psychiatric centres [134]. The paper presented a FightNet architecture to represent complicated visual violence interaction. An additional modality of image acceleration is introduced for extracting the motion attributes. Multimodal inputs such as RGB images for spatial networks, optical flow images and acceleration images for temporal networks are used for fusion to give us the desired classification. The dataset used is known as violent interaction dataset (VID). The metric used in the proposed model for violent interaction detection is Accuracy. The following figure describes a sample of dataset.
Li et al. [135] presented the facts that numerous solutions on deciphering behaviour solutions are focused on UCF101 video dataset, such as sports, cooking and other simple routines which are less useful in real-life surveillance scenarios. Here a multiplayer violence detection procedure is evolved which is based on three-dimensional convolutional neural network (3D CNN). With 3D CNN in play the spatiotemporal features are easily extracted for violence detection. The dataset used is Multiplayer Video Dataset. The experimental results utilized the metric of Accuracy in violence detection.
Tran et al. [136] proposed a procedure for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) [136]. This is a seminal paper for establishing the fact that 3D ConvNets have better classification results as compared to 2D ConvNets. The dataset utilized was UCF101. The experimental results utilized the metric of Accuracy in violence detection.
Baba et al. [137] proposed an efficient procedure for automatically detecting violent behaviour for video sensor networks [137]. The concept utilizes Raspberry PI-embedded architecture for a real-time solution. Features are traversed into a deep neural network which is followed by a time-domain classifier. The solution focusses on extracting motion feature vectors directly from video. The paper claims to use low computational resource. The metric utilized in the paper was AUC. The datasets utilized were BEHAVE, ARENA.
Ullah et al. [138] proposed an efficient method of a solution based on the videos from surveillance cameras in smart cities [138]. A deep learning-based solution for violence detection is proposed. A process is designed for extracting important frames and discarding the processing of useless frames. A light-weight convolutional neural network (CNN) model is used in the paper. At the pre-processing stage sequences of frames (multiple of 16) are passed into 3D CNN and fed to the Softmax classifier. Optimization of 3D CNN model is done using toolkit developed by Intel, which converts the trained model into intermediate representation and adjusts it for optimal execution at the end platform for the final prediction of violent activity. An IoT element is added through which the detected violent activity is transmitted to the near control centre for preventive actions. The metric utilized in the paper was Accuracy. Standard datasets used in the process consisted of Violent Crowd, Violence in Movies and Hockey Fight datasets.
Dai et al. [139] proposed an efficient method for the techniques for violent scene detection [139]. The paper trains a Convolutional Neural Network (CNN) model with a subset of ImageNet classes suited for violence detection. Two stream CNN architecture is used for extracting features on static frames as well as extracting optical flows using motion vectors. The presented architecture is supported by Long Short-Term Memory (LSTM) models which can capture the longer-term temporal dynamics. Another input is supplied via adding conventional motion features as well as audio feature vectors into the deep learning framework. A fusion logic is applied by fusing all the advanced features. The metric utilized in the paper was mean average precision (mAP) and Accuracy. The datasets utilized were Violent Crowd, Violence in Movies and Hockey Fight. Dong et al. presented a framework of multi-stream deep convolutional neural networks, for person to person violence detection in videos [140]. Besides the conventional spatial and temporal streams an acceleration stream is also proposed for capturing the violent actions. Multi-stream inputs become an input for information fusion. The results were quite effective on the standard datasets on violence domain. The metric utilized in the paper was Accuracy. The datasets utilized were Hockey Datasets.
Sudhakaran et al. [141] presented a framework for automatic analysis of surveillance videos [141] for detection of violence in videos. The convolutional neural network is used to extract discriminant features in a video. Post extraction of features, an aggregation scheme is used for aggregating the features using a variant of the long short-term memory that uses convolutional gates. CNN along with LSTM captures the localized Spatio-temporal features which help to locate the motion feature vectors in the video. The paper also presented the concept of frame differences in the pre-processing portion of the process. Then this difference is fed into the model to specifically focus on the changes in the video. Standard datasets are used for evaluating the performance of recognizing violent videos. The metric utilized in the paper was Accuracy. The datasets utilized were Hockey Fight Dataset, Movies Dataset and Violent-Flows and Crowd Violence Dataset.
Mu et al. presented a deep learning framework for violent scene detection (VSD) in videos [142]. Visual cues are an important attribute in the detection of violent scenes. The paper investigates usage of CNNs for violent scene detection assisted by the acoustic information imbibed in the video. CNN works as a feature classifier as well as feature extractor for acoustic information imbibed in the video. Two separate streams consisting of acoustic feature as well as visual information are fused together to improve violence detection. The metric utilized in the paper was Average Precision. The datasets utilized MediaEval 2015.
Mohammadi et al. [143] came up with improvements of Social Force Model (SFM) [143]. SFM-based methods lack possible explanation for complex crowd behaviours. A new hybrid scheme is proposed for violent events detection in crowd videos. Set of behavioural heuristics is assembled and transformed into physical equations. These equations are then modelled to describing behaviours in the video. The heuristic approach paves the way for classifying the violence events. The CNN based approach is limited but has been used only for comparisons. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Violence in Crowds (VIC), Violence in Movies (VIM) and BEHAVE datasets. Marsden et al. present ResnetCrowd architecture for violence detection as well as crowd counting as well as crowd density level classification [144]. An entirely new dataset consisting of 100 images is presented and is known as Multi-Task Crowd. The paper asserts the introduction of a new dataset in the computer vision field which is fully annotated for crowd counting, violent behaviour detection and density level classification. The framework proves that the multi-task approach boosts individual task performance for all tasks especially violent behaviour detection which receives a 9% boost in ROC curve AUC (Area under the curve). ResnetCrowd presented plausible and promising results in the multi-task domain. The metric utilized in the paper was AUC. The datasets utilized in this paper Multi-Task Crowd datasets.
Fenil et al. [145] proposes violence detection solutions during football matches [145]. The paper presents violence detection in the streaming data of the matches. Video streams is fed an input to a Spark framework which helps to extract features using HOG (Histogram of Oriented Gradients) function. Three types of modelling streams are constructed. These modelling streams are violence model stream, human part model stream and negative model stream. The frames from the videos are divided into the above models. These frames are then utilized to train the Bidirectional Long Short-Term Memory (BDLSTM) network for recognition of violence scenes. As the name suggests the bidirectional LSTM can process information from the front as well as rear ends. The output takes care of past information as well as future information. The metric utilized in the paper was Accuracy. The datasets utilized in this paper is violent interaction dataset (VID).
Serrano et al. [146] proposes a new hybrid framework consisting of shallow features as well as deep learning feature and provided a better classification accuracy [146]. This hybrid method is compared to three standard sets. The metric utilized in the paper was Accuracy. The datasets utilized in this paper are Hockey dataset, Movies dataset, Behave dataset. Zhou et al. presented a 3D-CNN based violence detection method [147]. The three-dimensional deep neural network directly manipulates on the input which actually extracted the spatial and temporal characteristics of the frames of the videos. The paper clearly brings out the fact that this scheme of things can identify violent behaviour better than the characteristics of hand-craft features. The metric utilized in the paper was Accuracy. The datasets utilized in this paper are Hockey dataset, Movies dataset, Behave dataset. Mukherjee et al. discussed the anomalies present in videos such as fights [148]. This paper focuses on finding fight scenes in Hockey sport videos using blur & radon transform and convolutional neural networks (CNNs). The local motion parameters within the video frames were extracted using blur information. Thereafter Fast Fourier and radon transform were applied on the local motion. Transfer learning was used using pre-trained deep learning model VGG-Net. The metric utilized in the paper was Accuracy. The dataset used was Hockey dataset. Nova et al. presented a novel approach using machine learning frameworks of Support Vector Machine (SVM) for violence detection [149]. The inputs to the SVM is generated through a novel pose estimation algorithm. A set of popular hand-crafted features based on angles, velocity and contact detection are also fed in to SVM for detection of violent behaviour in a video. The metric utilized in the paper was F1- Score. The dataset used was Social Activity Dataset [150].
Mandal et al. presented a fine-tuned deep convolutional neural residual network for deciphering crowd behaviour [151]. Subclasses of feature maps are constructed which includes violent behaviour too. Subclasses modelling introduces discriminative features. A distinct time-warping technique based on the cosine distance measure is used to approximate the similarity measure between videos. A normal nearest neighbour (NN) classifier is used for classification of crowd behaviour attributes. The metric utilized in the paper was AUC. The dataset used was Social Activity Dataset. Ammar et al. presented a combinatorial approach of LSTM and the features of CNN [152]. The convolutional gates in the LSTM are trained to encode local regions temporal changes. This enables the entire network to encode localized spatiotemporal characteristics. The hierarchical characteristics are extracted from the video frames and then the convolution layers are trained and aggregated using the LSTM layer. The metric utilized in the paper was Accuracy. The dataset used was Hockey, Violent-Flows Dataset and VSD Benchmark. Meng et al. presented a framework of violent behaviour detection by integrating trajectory and deep convolutional neural networks [153]. A variation of CNN known as convNets model is experimented on the UCF101 dataset. The inspiration of developing the ConvNet model is derived from VGG-19 net which consists of 17 convolution pool-norm layers and two fully connected layers. The metric utilized in the paper was Accuracy. The dataset used was Hockey Fights and Violent-Flows dataset.
Zhuang et al. presented Convolutional DLSTM (ConvDLSTM) based deep architecture for crowd scene understanding [154]. ConvDLSTM derives itself from a combination of GoogleNet Inception V3 convolutional neural networks (CNN) and stacked differential long short-term memory (DLSTM) networks. ConvDLSTM optimizes the inherent parameters of CNN and RNN. A raw image is taken as an input devoid of trajectory information. Semantic information of CNN and the memory states of LSTM makes analysis of crowd scene and motion information easier. The metric utilized in the paper was Accuracy. The dataset used were Violent-Flows and CUHK Crowd datasets. Blunsden et al. [155] presented techniques for detection of aggressive activities and crowd violence actions in [155]. It uses a transfer learning-based violence detector. With the help of Lucas–Kanade method an optical flow feature vector is calculated. Thereafter various 2D templates are derived by overlapping optical flow magnitudes and orientations. These templates can be termed as template feature vectors (TFV). These TFV are fed into a pre-trained convolutional neural network and deep features of different layers are extracted. Classic support vector machine (SVM) and k-nearest neighbour classifiers are then trained on above TFV and further classified as violent or non-violent categories. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Hockey dataset, Movies dataset, Violent-flows (ViF) dataset.
Elesawy et al. [156] presented a technique for detection of aggressive activities [156] based on the Bidirectional Convolutional LSTM (BiConvLSTM) architecture. Introduction of Spatiotemporal Encoder with addition of bidirectional temporal encodings and an elementwise max-pooling helps in violence detection. The Spatial Encoder presents considerate performance on the Hockey Fights and Movies datasets. However, on the Violent Flows dataset, the Spatiotemporal Encoder outperforms the Spatial Encoder. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Hockey dataset, Movies dataset, Violent-flows (ViF) dataset. Table 4. highlights datasets and the CNN specific violence analysis methods along with respective deep learning platforms for crowd analysis.
Datasets for Violence Analysis
A lot many concepts have been presented on the conceptual approaches of Convolutional Neural Network for violence detection on different datasets. A quantitative result of various methods which claim to solve the problems of violence detection using CNN techniques is discussed. There are various metrics which is used for assessment of the approaches in the literature. Here we delve in to some of the popular universally agreed measures for violence detection model evaluation. The table presents various commonly used metrics while comparing the results in various papers. Different metrics have been used in different methods for the comparison of results. Table 6 summarizes the quantitative results of various CNN based violence detection methods with respect to the datasets utilized in their respective approaches. Majority of the use the metric accuracy for image classification evaluation. Accuracy is defined as the proportion of tweets that has been correctly classified among all image content. Accuracy is a very intuitive metric and is computed using Equation. (1). In the below equation following needs to be considered:
Annotation Metric [15]
Results comparison tables for CNN based approaches for Violence Detection

Modelling proposed for Deployment of Violence Detection in Real Time Systems.
Various approaches to violence detection have been discussed using CNN as a primary parameter. Different approaches have been used using CNN such as two streams, 2D CNN, 3D CNN, spatiotemporal features, a combination of handcrafted as well as deep features, classical machine learning approaches, binary classification, multi class action recognition and so on. We are confident that current research on CNN has prompted the research community to seriously look into CNN based solutions for a real-time violence detection. It can really help society in general and can act as a safety valve for public security. We observe that the CNN based methods surveyed for violence detection here lie in one of the following categories: The depth of ConvNets and the pre-processed input fed into the deep architecture can change the output. Different variations of CNN combined with its architectural variations and fusing the classification decisions can also result in a better classification result. Feature extraction using CNN and novel pre-processing methods and then applying different classifiers to classify the result. New deep architectures assisted with LSTM, BDLSTM, DNN, Auto Encoders etc. can help to enhance the performance. Use of transfer learning in violence detection.
One of the major challenges for violence detection is the lack of real-time datasets. Most of the violence activities lie under the domain of abnormality. Abnormal situations are just a subpart of large real-time video. A large pool of multimodal activities needs to be annotated so as to detect the violence activities with the scene contexts too. Transfer learning is the new paradigm shift over which various new cases of classification is being done. Utilization of transfer learning on various datasets can help to increase the performance of the CNN based models. Multiple subcategories of violence can be a classic case for further research. Also, if there can be a method where few training samples with huge amount of labelled information can be used for training to learn the system, the need of building huge training datasets can be done away. The main aim of the survey is to provide an insight into usefulness of CNN based violence detection methods and can generate trigger and alarms for community and policing services for better handling of violence-related tasks. The idea of surveying this violence theme in crowd domain is to encourage the deployment of deep learning framework in actual scenario. Most of these results inculcations are left in to the research domain itself. Our aim is to to make a working model for the web as well mobile platform. Below is a framework of the system that needs to be deployed in order to leverage the violence detection phenomenon. The input shall be any image/video and the models need to be deployed in a standard deep learning framework that has discussed above in the survey. Keras, Tensorflow and Pytorch based models are popular for actually saving the model file and retraining for pretrained models and deployment in actual violence detection module.
The above model can be deployed on cloud server if the data is large and in real time. Suitable hardware consisting of GPU needs to be deployed to detect the violence. Post detection of violence the system can be helpful to the security agencies to identify the cause of violence, making irrefutable proof to nab the culprits, making a database and dataset for further learning of the system. Retraining of the model with fine tuning of hyperparameters to be done as new data arrives. Armed with deployed violence detection framework, the emergency response services can be rushed to the affected area. We must remember that the whole point of deployment is to stop violence from spreading. And if the violence does spreads, the system should contain the violence affected area with automatic classification results, alerts and alarms. A classic surveillance system for monitoring the affected areas will always be helpful in future course of action. One more reason to research the violence detection using ConvNets is that currently it has been proved categorically that visual detection of human actions can play an important roles in affecting public sentiments [166] [167]. The violence detection deployed model proves the point that visual media, visual altercations, violence has a deep impact on the public and impacts the social layers of human society.
Conclusion
This survey presents the paper using CNN for violence detection. A fact also emerges that violence detection will become more meaningful with the scene context in which it is evaluated. A subtle discussion on handcrafted features is also ingrained. We also concluded that sometimes the combination of handcrafted as well as deep features combination can increase the classification accuracy. Multiple feature fusion methods also increase the results accuracy on CNN based methods. We have touched upon the need to analyse violence, its motivations and automatic classification of images. We have tried our best to discuss violence, methods, architectures and have a final discussion with respect to crowd domain. Isolated violence between few people and mob violence are entirely different perspective of the violence build-up situations. We have reviewed extensively the CNN based methods for violence detection and come up with a performance comparison. We have also explored majorly used datasets for violence detection. We also identified the challenges of the datasets, CNN approaches and given possible impetus to the pre-processing of data that is fed into CNN. Major research needs to be done within the concept of various crowd definitions of different countries and at different locations. This shall help us to understand CNN based violence detection even better. Complex scene analysis with violence detection algorithm can be a boost for security forces. We hope this survey can bring the appreciation over the increase usage of deep learning processes and efficient usage of Convolutional Neural Network in detection of violent processes. It is important for the researchers to understand the inception and processes of the violent activities in different parts of the world.
