Violence recognition using convolutional neural network: A survey

Abstract

Violence detection is a challenging task in the computer vision domain. Violence detection framework depends upon the detection of crowd behaviour changes. Violence erupts due to disagreement of an idea, injustice or severe disagreement. The aim of any country is to maintain law and order and peace in the area. Violence detection thus becomes an important task for authorities to maintain peace. Traditional methods have existed for violence detection which are heavily dependent upon hand crafted features. The world is now transitioning in to Artificial Intelligence based techniques. Automatic feature extraction and its classification from images and videos is the new norm in surveillance domain. Deep learning platform has provided us the platter on which non-linear features can be extracted, self-learnt and classified as per the appropriate tool. One such tool is the Convolutional Neural Networks, also known as ConvNets, which has the ability to automatically extract features and classify them in to their respective domain. Till date there is no survey of deciphering violence behaviour techniques using ConvNets. We hope that this survey becomes an exclusive baseline for future violence detection and analysis in the deep learning domain.

Keywords

Violence detection crowd behaviour ConvNets convolutional neural networks deep learning survey

1 Introduction

Human behaviour is complex. Group of humans for a common cause can behave in different manner to make their agendas visible to the world. One of such behaviour is violent behaviour which has become more common in the world for various reasons. Most affected places where violence can create panic are the public places. Roads, malls, railway station, busy intersections of the cities and many more becomes the target of the violent crowd behaviour. Violence in public places is an unacceptable behaviour which is a threat to life for all the people. Public property is lost to arson, riots and so on. Computer vision provides us the platform on which we can decipher crowd behaviour and can take cues from the developing situations for maintaining law and order in the respective place. Real time detection of violent activities can immensely benefit the city administration. There have been two ways that have become the norms of deciphering these violent activities. One way is to use shallow features or handcrafted features for detection of violent behaviour. Another popular way is the usage of deep learning methods and its tools for violent behaviour detection. Deep learning systems require multiple layers of interaction between the processed data. Handcrafted features were needed to be devised to develop discriminant features. Deep learning has made this learning self-sufficient. Thus, the system has evolved from the shallow approach to automatic classification approach using deep learning methods and tools. Deep learning is all about finding and exploiting the discriminant feature space. These feature spaces need to be handled in a structured way for optimal classification solution. These discriminant feature undergoes multiple layered operations and classification algorithm [1] [2] for challenging computer vision problems. The deep learning concepts [3 –7] along with machine learning classification supports have actually provided tremendous impetus in solving challenging problems. Machine learning provides solutions to go through feature space and does the classification in the respective domains. Hybrid architectures [8] are a new addition to deep learning-based solutions.

One such tool of deep learning which has become immensely popular is Convolutional neural networks (CNNs). Different analysing techniques such as principal component analysis, clusters of patches of images, dictionary-based approaches, sparse representations and so on have been a classification choice of CNN methods [2]. A preliminary introduction of the CNN applications has been discussed in [9]. A medical application was reported way back in 1995 [10]. A popular application of hand written digit recognition known as LeNet [11] became a path breaker in field of CNN.

An excellent contribution to the computer vision challenges was provided by [12] in a unique challenge known as ImageNet challenge where AlexNet, was proposed for the first time. This paved the path of utilization of Convolutional Neural Network further in the deep learning approaches. Thereafter deeper architectures were proposed [13] and have been deployed in various application areas. A range of practical and research-oriented applications using deep learning techniques have been described in [14 –21] have been reported in the literature which consist of pattern recognition, mathematical model development, data mining, artificial intelligence and so on. We have established the premise that Convolutional neural networks (CNNs) is the tool through which we proceed our survey on violence analysis.

Violence is defined as the intentional use of physical force or power, threatened or actual, against oneself, another person, or against a group or community, that either result in or has a high likelihood of resulting in injury, death, psychological harm, maldevelopment or deprivation [22]. This survey paper focusses on CNN based techniques for violence analysis. The main objectives of the work are as follows:

Violence analysis dissection based comprehensive overview of ConvNets from different angles. We try to build up a taxonomy which can motivate the researchers to look towards ConvNets as a viable option for their research.

Since our focus is entirely on ConvNets based methods for violence analysis we discuss their performance and various challenges that incite researchers to design more effective algorithms.

A brief touch upon the popular datasets for violence analysis process.

Open discussion on the future aspects of the violence analysis process.

The structure of the paper is as follows: Section 2 describes a very brief introduction of Convolutional Neural Networks. Section 3 focusses on the motivation for violence analysis. Section 4 discusses various latest Convolutional Neural Networks (CNN) based violence analysis methods backed by the popular datasets. Section 5 provides a brief overview of the quantitative and qualitative results of various violence analysis methods. Section 6 discusses the future research directions in violence analysis. We conclude in the last section 7.

2 Convolutional Neural Network: A Brief Overview

In this section we discuss about the ConvNets and its details so as to appreciate the understanding of using ConvNets for the vision problems. In the last decade, there has been positive intrusion of deep learning methods in human lives [23 –25]. Novel methods for pattern recognition tools [1, 26] have been reported in the literature which utilizes the supervised learning methods. Auto-encoder [28] and Deep Belief Networks (DBNs) [25, 26] are popular unsupervised learning methods. Hubel et al. discussed usage of neural layered architecture [29]. LeCun et al. proposed neural architecture called LeNet 5 [11] which was used for recognition of hand-written digits and words with 99.2% accuracy on the popular MNIST dataset [30]. Convolution has been referred in literature as cross-ventilation [31]. A series of deeper architecture were proposed such as AlexNet [12], ZFNet [32], VGGNet [33], GoogleNet [34] and ResNet [35]. Various variations of ResNet [35] have been reported in [36 –40]. The latest addition to these networks have been Residual attention network [41]. Further Convolutional Neural Network architecture were ably supported by other deep learning streams of Recurrent Neural Network [42] and LSTM [43]. New addition to the above architecture is the usage of attention-based networks architectures for analysing and predicting discriminant features. Vital attributes of CNN are convolutional, pooling, and fully-connected layers. Their types and applications are described in Table 1.

Table 1
Attributes of ConvNets [44]

Key Attributes (Convolution, Pooling and Activation Functions) Applications

Tiled Convolution [45] [46]

Dilated Convolution [47] [47 –50].

Network in Network Convolution [51] –

Inception Convolution [34] [35, 52]

Transposed Convolution [32 , 53–55].

L_pPooling [56, 57] [58, 59].

Mixed Pooling [60, 61]. [47 –50].

Stochastic Pooling [62] –

Spectral Pooling [63] [35, 52]

Multi-scale Orderless Pooling [64] [65]

AlphaMEX Pooling [66] –

Rectified Linear Unit (ReLU) Activation Functions [67] –

Leaky ReLU Activation Functions [68] –

Maxout Activation Functions [69] –

Probout Activation Functions [70] –

Parametric ReLU Activation Functions [71] –

Randomized ReLU Activation Functions [72] –

Exponential Linear Unit (ELU) Activation Functions [73] –

Multiactivation Pooling (MAP) Activation Functions [74] –

SGD Optimization [75, 76] –

Adagrad Optimization [77] –

Parallelized Stochastic Gradient Descent Optimization [78 –80] –

Downpour Stochastic Gradient Descent Optimization [81]

Asynchronously Stochastic Gradient Descent Optimization [82]

Quickprop Optimization [83]

Nesterov Accelerated Momentum Optimization [83]

Conjugate Gradient method Optimization [83]

Key Attributes (Convolution, Pooling and Activation Functions)	Applications
Tiled Convolution [45]	[46]
Dilated Convolution [47]	[47 –50].
Network in Network Convolution [51]	–
Inception Convolution [34]	[35, 52]
Transposed Convolution	[32 , 53–55].
L_pPooling [56, 57]	[58, 59].
Mixed Pooling [60, 61].	[47 –50].
Stochastic Pooling [62]	–
Spectral Pooling [63]	[35, 52]
Multi-scale Orderless Pooling [64]	[65]
AlphaMEX Pooling [66]	–
Rectified Linear Unit (ReLU) Activation Functions [67]	–
Leaky ReLU Activation Functions [68]	–
Maxout Activation Functions [69]	–
Probout Activation Functions [70]	–
Parametric ReLU Activation Functions [71]	–
Randomized ReLU Activation Functions [72]	–
Exponential Linear Unit (ELU) Activation Functions [73]	–
Multiactivation Pooling (MAP) Activation Functions [74]	–
SGD Optimization [75, 76]	–
Adagrad Optimization [77]	–
Parallelized Stochastic Gradient Descent Optimization [78 –80]	–
Downpour Stochastic Gradient Descent Optimization [81]
Asynchronously Stochastic Gradient Descent Optimization [82]
Quickprop Optimization [83]
Nesterov Accelerated Momentum Optimization [83]
Conjugate Gradient method Optimization [83]

Convolutional neural networks (CNNs) are usually deployed at a uniform cost. The modern hardware now consists of GPU (Graphical Processing Unit). This hardware has great computation power. Some of the popular software frameworks in the deep learning domain on which Convolutional neural networks (CNNs) can be executed are Apache Singa [84], Caffe [85], Deeplearning4j [86], Dlib [87], Keras, Microsoft Cognitive Toolkit [88], MXNet [89], OpenNN [90], TensorFlow [91], Theano [92], Torch [93].

3 The motivation for violence analysis

Violence is described as a “behaviour which is intended to hurt, injure, or kill people” [94]. Violent behaviour can be exhibited by individual as well as crowd. A crowd can be described as consisting of large number of people displaying some sort of coherent behaviour [95]. Crowd behaviour analysis [96] becomes the base for deciphering the violence activities in the crowded domain. Diagrammatic analysis of crowd attributes is shown in Fig. 2. Most of the applications in the crowded domain tends to converge on the people detection and tracking [97]. While analysing the crowd attributes we conclude that violence pertaining to crowds could be maximum while dealing with anomaly detection [98], behavioural pattern recognition [99]. We also realize that the crowd counting applications [100] can actually give hints to increasing density among crowds which can become a potential source of triggering violence. People feel secure when they have a breathing space in a crowded area. As soon as they realise that their breathing space is encroached upon, they start taking evasive actions. Overall a proper modelling and inference system [101] should be ready with any hints of triggers of crowd violence. Below figure is a clear indication that majority of crowd analysis involves overall crowd tracking with its respective dense trajectories and motion patterns. Another feature that is involved in the system is the analysis of abnormal crowd analysis. Crowd behaviour generally boasts of different patterns of crowd movements. The data extracted from the videos is used for crowd behaviours understanding. Rabiee at al. came up with a beautiful explanation of attaching crowd emotions as attributes for crowd behaviour understanding [102]. They explored the idea of emotion-based classifiers which can be used for representing crowd motion. It can also involve crowd behaviour as can be seen in Sultani et al in their respective paper [103].

Fig. 1

Illustration of Traditional Convolutional Neural Network (Inspired by [11]).

Fig. 2

Crowd analysis percentage courtesy [96].

Fig. 3

Violence reported (a) [104], (b) [105] (c) [106](d) [107].

The above discussion and explanation make it clear that violent behaviour that emancipated from the crowd falls under the broader abnormality detection. We specifically look for this behaviour anomaly in crowds for violence detection systems.

3.1 Violence analysis

Non-violent management of crowd is the key to success to manage any huge event. The large gathering of crowd can be seen in malls, clubs, stadiums, auditorium, political rally etc. All these places require efficient and real-time crowd management. In a developing nation like India, crowd gathering on roads, malls, railway stations, temples, funerals, stadiums and so on presents a huge challenge for the emergency response team to prevent any violent activities. The survey does not intend to find the ethical reasons behind the crowd violent approaches. Every country may have some burning issues based on caste, creed, race, colour, ethnicity, religions, belief system and so on. Each one the above-mentioned system can incite the crowd if any perceived injustice perception is harboured. It has been observed, in general, that panic in crowds can trigger violent activities. Violence does not erupt in isolation. Individual violence involving two people also lies in the ambit of crowd violence. Usually the theme of the violence starts from a peaceful protest, sloganeering and then rise up to the violence in which the mass participates. Violence detection modelling is absolute necessity so as to prevent loss of lives, property and make the law and order authorities in control of the situation. Effective emergency response can be initiated in the violence affected area. The aim of choosing this detection of violence theme is to minimise the loss of human lives. Few of the reasons which trigger panic reactions in public are as follows:

Human Conflicts: The Human mind is a complex architecture consisting of so many emotions ranging from love, hate, anxiety, anger and so on. Human to human conflicts can turn in to fights which can result in a violent conflict. This violent conflict can trigger panic reactions among other humans who forms the crowd spaces.

Fire Trigger: Showcasing of fire arms, and flaring of fires can trigger crowd panic. If the exit criteria for the crowd is not defined properly violent reactions may happen.

Genuine violence Trigger: Sometimes any injustice done to humans can trigger massive waves of rage or anger and it can engulf the crowd as a whole. The crowd as a whole start behaving erratically and can cause massive violence.

Fear Trigger: If there is a rumour of any bomb/volatile substances within subspaces of a crowd then fear may grip the crowd. This fear can trigger emotions which can turn violent and can cause death at times.

Encroachment of human breathing space: Humans always believe that there should be enough free space within which they feel safe. If their imaginary personal spatial boundary becomes limited, then they revert to the panic mode which can seriously cause trampling and in some case death.

Violence detection strategy is the need of the hour. Simulation pertaining to crowd-based violence must be done to achieve peace in crowded places. The images in Fig. 2 shows why violence detection in the crowd becomes so important in the context of the places where large political rallies, important religious functions, and joyful festival celebrations take place. Automatic detection of violence in such places becomes extremely critical. To facilitate smooth handling of crowds we need a real-time system to ensure safety of large crowds.

3.2 Detection of violence

Automating detection of violence analysis can be termed as one of the most important aspects of violence analysis. Modern researchers are interested to detect abnormal behaviours [101] especially in the area of crowd. Abnormal behaviour detection is difficult [108], especially in a crowded place. With the increase in the usage of smartphones and high-tech cameras, video surveillance is one of the most common approaches for surveillance purposes [109]. Whenever there is a security issue the normal procedure is to install Closed-Circuit Television (CCTV). The current emphasis is on detection of violence in an area, especially in a crowded place. Video surveillance is an important tool for detecting anomalies [110]. Usually, crowd analysis is done by focussing on the crowd density and keeping a track of crowd count, detection of crowd motion [101, 111] with respect to the historical behaviour of crowd gathering at that particular places.

3.3 Previous surveys for violence analysis

Violence detection, fight detection and any abnormality detection come under a strict curriculum of crowd behaviour analysis. Table 2 presents a summary of the surveys for crowd related analysis. These surveys have ranged from analysing the crowd, crowd counting, crowd density, crowd behaviour, scene analysis since 2008. All these surveys were detailed.

Table 2
Previous Crowd Based Surveys

Year Title Brief description

2008 Crowd analysis: a survey [101] This survey discuss about the crowd analysis approaches in the computer vision fields and compares various approaches with respect to the other domains too.

2010 Crowd analysis using computer vision techniques [112] This survey discusses about the various aspects of computer vision such as crowd tracking, people counting and their validation approaches.

2010 A Survey of Human-Sensing: Methods for Detecting Presence, Count, Location, Track, and Identity [113] This survey nosedives in to multi-disciplinary approaches of human sensing and relies heavily on the spatio temporal features of human presence, human count, human location, human track and human identity.

2013 Crowd counting and profiling: Methodology and evaluation [114] This particular survey provides various comparison for video-based crowd counting approaches and provides an evaluation of different approaches using the singular protocol.

2014 Performance evaluation of crowd image analysis using the PETS2009 dataset [115] This particular survey presented a novel crowd-based datasets PETS2009. It discusses the detection approaches and tracking of crowds.

2015 Crowded scene analysis: A survey [116] The paper discusses the scene analysis. It also discusses the challenges of visual occlusions, ambiguities in the crowded scene. Different aspects of crowd are covered including artificial learning of motion patterns for crowd, behaviour and anomaly detection.

2015 An evaluation of crowd counting methods, features and regression models [117] This survey analyses histogram-based approaches on the crowd images and compares various image features and gave results using various regression models.

2015 Recent survey on crowd density estimation and counting for visual surveillance [118] This survey paper is based on crowd density estimation and counting methods and discusses given approaches of crowd density and counting for visual cues.

2016 Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques [96] Thus, survey paper discusses visual crowd analysis and specifically focusses on the generic aspects of usage of techniques for visual crowd analysis.

2017 Crowd scene understanding from video: a survey [119] This survey specifically deciphers the crowd analysis on the basis of crowd statistics and crowd behaviour understanding.

2018 A survey of recent advances in CNN-based single image crowd counting and density estimation [120] This is a fantastic review of crowd counting crowd counting and density estimation methods with ConvNets based approaches.

2019 Convolutional neural networks for crowd behaviour analysis: a survey [44] This is a recent paper which covers an exhaustive range of utilization and application of convolution neural network (CNN)-based methods for crowd behaviour analysis. This survey discusses in detail about the types and qualities of CNN for deciphering crowd behaviour. A very minute portion is dedicated to violence pattern exhibited by the crowd.

2019 Intelligent video surveillance: a review through deep learning techniques for crowd analysis The paper presents the survey consisting of object recognition, action recognition, crowd analysis and finally violence detection in a crowded environment.

Year	Title	Brief description
2008	Crowd analysis: a survey [101]	This survey discuss about the crowd analysis approaches in the computer vision fields and compares various approaches with respect to the other domains too.
2010	Crowd analysis using computer vision techniques [112]	This survey discusses about the various aspects of computer vision such as crowd tracking, people counting and their validation approaches.
2010	A Survey of Human-Sensing: Methods for Detecting Presence, Count, Location, Track, and Identity [113]	This survey nosedives in to multi-disciplinary approaches of human sensing and relies heavily on the spatio temporal features of human presence, human count, human location, human track and human identity.
2013	Crowd counting and profiling: Methodology and evaluation [114]	This particular survey provides various comparison for video-based crowd counting approaches and provides an evaluation of different approaches using the singular protocol.
2014	Performance evaluation of crowd image analysis using the PETS2009 dataset [115]	This particular survey presented a novel crowd-based datasets PETS2009. It discusses the detection approaches and tracking of crowds.
2015	Crowded scene analysis: A survey [116]	The paper discusses the scene analysis. It also discusses the challenges of visual occlusions, ambiguities in the crowded scene. Different aspects of crowd are covered including artificial learning of motion patterns for crowd, behaviour and anomaly detection.
2015	An evaluation of crowd counting methods, features and regression models [117]	This survey analyses histogram-based approaches on the crowd images and compares various image features and gave results using various regression models.
2015	Recent survey on crowd density estimation and counting for visual surveillance [118]	This survey paper is based on crowd density estimation and counting methods and discusses given approaches of crowd density and counting for visual cues.
2016	Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques [96]	Thus, survey paper discusses visual crowd analysis and specifically focusses on the generic aspects of usage of techniques for visual crowd analysis.
2017	Crowd scene understanding from video: a survey [119]	This survey specifically deciphers the crowd analysis on the basis of crowd statistics and crowd behaviour understanding.
2018	A survey of recent advances in CNN-based single image crowd counting and density estimation [120]	This is a fantastic review of crowd counting crowd counting and density estimation methods with ConvNets based approaches.
2019	Convolutional neural networks for crowd behaviour analysis: a survey [44]	This is a recent paper which covers an exhaustive range of utilization and application of convolution neural network (CNN)-based methods for crowd behaviour analysis. This survey discusses in detail about the types and qualities of CNN for deciphering crowd behaviour. A very minute portion is dedicated to violence pattern exhibited by the crowd.
2019	Intelligent video surveillance: a review through deep learning techniques for crowd analysis	The paper presents the survey consisting of object recognition, action recognition, crowd analysis and finally violence detection in a crowded environment.

As per the above Table 2, computer vision-based approaches using CNN for violence detection has been missing in all the previous surveys. Violence analysis is a fine behaviour subset of crowd behaviour analysis which needs to be studied precisely for violence outbreak behaviour. The amount of violence that is being witnessed in the world is huge. Violence harms the human lives, destroys public properties and gives a huge economic cost to the government. The above table clearly lay the premise that violence laced crowd behaviour needs to be studies using deep learning tool of ConvNets. Automatic extraction of features, data augmentation, layered deep architectures and ability to classify images and videos in real time are the best bets of future researchers. To the best of our knowledge violence analysis using CNN has been covered in a small portion in individual papers. There is a pending need to look at the violence analysis using CNN as it is one of the most accurate methods of deep leaning. This unique survey using Convolutional Neural Network shall focus on the violence detection method in depth.

4 Convolutional neural network based violence analysis

This section tries to decipher the possibility of violence detection in previously presented works. As we have already specified that we are building this survey paper on the analysis of crowd behaviours as a basis of violence detection. Thus, we are also arguing the possibility of detecting violence analysis in the previous existing papers. There have been various approaches towards mass behaviour in various situations. Researchers have taken different routes to obtain behaviour analysis. These approaches ranged from applying physics phenomenon, biological domain phenomenon to the computer vision field. Table 3 presents the fact that whether violence detection strategy using CNN can be explored in mentioned papers which were inspired by the different domains of physics, biology as well as computer vision domain.

Table 3
Violence exploration possibility in crowd dimensions. (P = Physics Domain, NB = Natural Biology Domain, CV = Computer Vision Domain, VA = Violence Analysis)

Crowd Analysis Based Papers Paper Name Domain Positive Cross Domain Analysis Possibility

[4] Collective motion P, NB, CV √

[121] The flow of human crowds P √

[122] Real-time crowd simulation: A review P, NB, CV √

[101] Crowd analysis: a survey, Machine Vision and Applications P, NB, CV √

[123] The perfect swarm: The science of complexity in everyday life P, NB, CV √

[112] Crowd analysis using computer vision techniques – √

[124] Visual crowd surveillance through a hydrodynamics lens P √

[125] Detection of abnormal behaviours in crowd scene: a review – √

[126] A literature review on video analytics of crowded scenes – √

[127] A review of physics-based methods for group and crowd analysis in computer vision P √

[114] Crowd counting and profiling: Methodology and evaluation – √

[116] Crowded scene analysis: A survey P √

Crowd Analysis Based Papers	Paper Name	Domain	Positive Cross Domain Analysis Possibility
[4]	Collective motion	P, NB, CV	√
[121]	The flow of human crowds	P	√
[122]	Real-time crowd simulation: A review	P, NB, CV	√
[101]	Crowd analysis: a survey, Machine Vision and Applications	P, NB, CV	√
[123]	The perfect swarm: The science of complexity in everyday life	P, NB, CV	√
[112]	Crowd analysis using computer vision techniques	–	√
[124]	Visual crowd surveillance through a hydrodynamics lens	P	√
[125]	Detection of abnormal behaviours in crowd scene: a review	–	√
[126]	A literature review on video analytics of crowded scenes	–	√
[127]	A review of physics-based methods for group and crowd analysis in computer vision	P	√
[114]	Crowd counting and profiling: Methodology and evaluation	–	√
[116]	Crowded scene analysis: A survey	P	√

From the above table it becomes aptly clear that the idea of cross domain research using ConvNets can boost up the crowd behaviour understanding tasks. Each of these papers have highlighted the areas of physics, biology along with computer vision techniques. Violence analysis cannot be left out of these domains too. Therefore, the possibility of tracing violence analysis using the methods and discussion applied in above papers can also be a new beginning in to violence research methodology. The above table boost up the idea of cross domain search strategy of deciphering crowd behaviour. Let us now visit some of the latest papers which have presented a viable solution towards detection of violence. In this survey paper, CNN based methods are discussed and analysed in below sub-sections. Perez et al. [128] presented a novel concept by proposing a pipeline of Convnets. In this paper, the pipeline was split into three specific steps. It mainly relied on the two stream architectures, 3D CNN architectures and local interest points. Discriminative features were extracted which would help us in the classification process. The respective streams consist of 2D CNN and the 3D CNN. Feature extraction for the two stream solutions is executed by generating and incorporating the two different models. One is done using RGB and another using Optical flows. An aggregation procedure is applied on this two-stream architecture using average pooling. 3D CNN was used using a pretrained model on Sports-1M [129] and the feature were extracted from a previous layer. The longer duration of CCTV recordings is analysed for detection of actual fights. The paper presented the importance of optical flows and had major impact on the performance optimization. The paper used CCTV-Fights dataset and the metric is used for evaluation was mAP and F-measure.

Fig. 4

CCTV-Fights dataset [128].

Fig. 5

Hockey Fights dataset [140].

Fig. 6

YouTube dataset [131].

A novel concept of 3D-CNN has been applied for violence detection in videos by Ding et al. [130]. Action recognition is the most researched topic in the computer vision field. The subset of fight detection also deserves due importance to detect violence. Since the 2D Convolution is great at spatial extraction of information and loses when the motion comes in to the play, we switch over to 3D Convolutional network by applying convolution to temporal sequences. This paper presents a novel 3D ConvNets model for violence detection in videos. The paper utilized Hockey Datasets dataset and the metric it used for evaluation was accuracy. Below is a sample image for Hockey datasets. The architecture used in this paper is as follows:

Fig. 7

Movies dataset [132].

Fig. 8

Violent interaction dataset [135].

Fig. 9

Multiplayer Video Dataset [135].

Sumon et al. presented a novel deep learning-based violent crowd flow detection solution [131]. It explicitly cites crowd conditions of Bangladesh. The paper collected explicit dataset from YouTube consisting of violent and non-violent crowd flows. Convolutional neural networks (CNN) based methods and long short-term memory network (LSTM) based architectures have been applied separately on this dataset and in combination as well. The concept of transfer learning is used for practical solution. A pre-trained model on movie violence dataset is used. It achieved a considerable accuracy on the presented dataset. The dataset used in this paper is YouTube within the context of Bangladesh. The metric used for evaluation of the result is F-measure.

Fig. 10

UCF 101 Video Dataset [136].

A unique approach for violence detection using modified 3D Convolutional Neural Network by Song et al. [132] was proposed. The focus of the paper was on the sampling methods of frames. It tried to focus on key frame selection. Random sampling is also adopted to produce input frame sequences. Standard popular violent detection datasets such as Hockey fight, Movies, and Crowd violence are used in this paper. Suitable variations have been performed on the length of clips on which 3D ConvNet is constructed.3D CNN was used extensively on the images to perform both the temporal and spatial convolution to come at the optimum calculations. The paper used Hockey Dataset and Movies dataset and the metric it used for evaluation was Accuracy. Below are sample images for Movies datasets.

Fig. 11

BEHAVE Dataset [137].

Fig. 12

MediaEval 2015 Dataset [142].

Fig. 13

Multi-Task Crowd Dataset [144].

Fig. 14

Violent interaction dataset (VID) [144].

Xu et al. [133] presented a novel strategy of P3D-LSTM for deep learning-based violent video classification assisted by spatial-temporal cues [133]. Multi-feature fusion methods are used for violence detection in videos. Static video frames are used for calculating the frame difference between consecutive images. Similarly, optical flow features are also calculated using frame difference approach. Thus, discriminant features are calculated. Post discriminant feature calculation, late fusion concept is applied for fusing the multiple features to obtain potent decision scores for classification labels. The experiment is conducted on standard two public databases and a self-built violent database prepared by Xu et al. The metric used for ascertaining the results is Accuracy. Zhou et al. discussed violent interaction detection in video surveillance for places such as railway stations, prisons or psychiatric centres [134]. The paper presented a FightNet architecture to represent complicated visual violence interaction. An additional modality of image acceleration is introduced for extracting the motion attributes. Multimodal inputs such as RGB images for spatial networks, optical flow images and acceleration images for temporal networks are used for fusion to give us the desired classification. The dataset used is known as violent interaction dataset (VID). The metric used in the proposed model for violent interaction detection is Accuracy. The following figure describes a sample of dataset.

Li et al. [135] presented the facts that numerous solutions on deciphering behaviour solutions are focused on UCF101 video dataset, such as sports, cooking and other simple routines which are less useful in real-life surveillance scenarios. Here a multiplayer violence detection procedure is evolved which is based on three-dimensional convolutional neural network (3D CNN). With 3D CNN in play the spatiotemporal features are easily extracted for violence detection. The dataset used is Multiplayer Video Dataset. The experimental results utilized the metric of Accuracy in violence detection.

Tran et al. [136] proposed a procedure for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) [136]. This is a seminal paper for establishing the fact that 3D ConvNets have better classification results as compared to 2D ConvNets. The dataset utilized was UCF101. The experimental results utilized the metric of Accuracy in violence detection.

Baba et al. [137] proposed an efficient procedure for automatically detecting violent behaviour for video sensor networks [137]. The concept utilizes Raspberry PI-embedded architecture for a real-time solution. Features are traversed into a deep neural network which is followed by a time-domain classifier. The solution focusses on extracting motion feature vectors directly from video. The paper claims to use low computational resource. The metric utilized in the paper was AUC. The datasets utilized were BEHAVE, ARENA.

Ullah et al. [138] proposed an efficient method of a solution based on the videos from surveillance cameras in smart cities [138]. A deep learning-based solution for violence detection is proposed. A process is designed for extracting important frames and discarding the processing of useless frames. A light-weight convolutional neural network (CNN) model is used in the paper. At the pre-processing stage sequences of frames (multiple of 16) are passed into 3D CNN and fed to the Softmax classifier. Optimization of 3D CNN model is done using toolkit developed by Intel, which converts the trained model into intermediate representation and adjusts it for optimal execution at the end platform for the final prediction of violent activity. An IoT element is added through which the detected violent activity is transmitted to the near control centre for preventive actions. The metric utilized in the paper was Accuracy. Standard datasets used in the process consisted of Violent Crowd, Violence in Movies and Hockey Fight datasets.

Dai et al. [139] proposed an efficient method for the techniques for violent scene detection [139]. The paper trains a Convolutional Neural Network (CNN) model with a subset of ImageNet classes suited for violence detection. Two stream CNN architecture is used for extracting features on static frames as well as extracting optical flows using motion vectors. The presented architecture is supported by Long Short-Term Memory (LSTM) models which can capture the longer-term temporal dynamics. Another input is supplied via adding conventional motion features as well as audio feature vectors into the deep learning framework. A fusion logic is applied by fusing all the advanced features. The metric utilized in the paper was mean average precision (mAP) and Accuracy. The datasets utilized were Violent Crowd, Violence in Movies and Hockey Fight. Dong et al. presented a framework of multi-stream deep convolutional neural networks, for person to person violence detection in videos [140]. Besides the conventional spatial and temporal streams an acceleration stream is also proposed for capturing the violent actions. Multi-stream inputs become an input for information fusion. The results were quite effective on the standard datasets on violence domain. The metric utilized in the paper was Accuracy. The datasets utilized were Hockey Datasets.

Sudhakaran et al. [141] presented a framework for automatic analysis of surveillance videos [141] for detection of violence in videos. The convolutional neural network is used to extract discriminant features in a video. Post extraction of features, an aggregation scheme is used for aggregating the features using a variant of the long short-term memory that uses convolutional gates. CNN along with LSTM captures the localized Spatio-temporal features which help to locate the motion feature vectors in the video. The paper also presented the concept of frame differences in the pre-processing portion of the process. Then this difference is fed into the model to specifically focus on the changes in the video. Standard datasets are used for evaluating the performance of recognizing violent videos. The metric utilized in the paper was Accuracy. The datasets utilized were Hockey Fight Dataset, Movies Dataset and Violent-Flows and Crowd Violence Dataset.

Mu et al. presented a deep learning framework for violent scene detection (VSD) in videos [142]. Visual cues are an important attribute in the detection of violent scenes. The paper investigates usage of CNNs for violent scene detection assisted by the acoustic information imbibed in the video. CNN works as a feature classifier as well as feature extractor for acoustic information imbibed in the video. Two separate streams consisting of acoustic feature as well as visual information are fused together to improve violence detection. The metric utilized in the paper was Average Precision. The datasets utilized MediaEval 2015.

Mohammadi et al. [143] came up with improvements of Social Force Model (SFM) [143]. SFM-based methods lack possible explanation for complex crowd behaviours. A new hybrid scheme is proposed for violent events detection in crowd videos. Set of behavioural heuristics is assembled and transformed into physical equations. These equations are then modelled to describing behaviours in the video. The heuristic approach paves the way for classifying the violence events. The CNN based approach is limited but has been used only for comparisons. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Violence in Crowds (VIC), Violence in Movies (VIM) and BEHAVE datasets. Marsden et al. present ResnetCrowd architecture for violence detection as well as crowd counting as well as crowd density level classification [144]. An entirely new dataset consisting of 100 images is presented and is known as Multi-Task Crowd. The paper asserts the introduction of a new dataset in the computer vision field which is fully annotated for crowd counting, violent behaviour detection and density level classification. The framework proves that the multi-task approach boosts individual task performance for all tasks especially violent behaviour detection which receives a 9% boost in ROC curve AUC (Area under the curve). ResnetCrowd presented plausible and promising results in the multi-task domain. The metric utilized in the paper was AUC. The datasets utilized in this paper Multi-Task Crowd datasets.

Fenil et al. [145] proposes violence detection solutions during football matches [145]. The paper presents violence detection in the streaming data of the matches. Video streams is fed an input to a Spark framework which helps to extract features using HOG (Histogram of Oriented Gradients) function. Three types of modelling streams are constructed. These modelling streams are violence model stream, human part model stream and negative model stream. The frames from the videos are divided into the above models. These frames are then utilized to train the Bidirectional Long Short-Term Memory (BDLSTM) network for recognition of violence scenes. As the name suggests the bidirectional LSTM can process information from the front as well as rear ends. The output takes care of past information as well as future information. The metric utilized in the paper was Accuracy. The datasets utilized in this paper is violent interaction dataset (VID).

Serrano et al. [146] proposes a new hybrid framework consisting of shallow features as well as deep learning feature and provided a better classification accuracy [146]. This hybrid method is compared to three standard sets. The metric utilized in the paper was Accuracy. The datasets utilized in this paper are Hockey dataset, Movies dataset, Behave dataset. Zhou et al. presented a 3D-CNN based violence detection method [147]. The three-dimensional deep neural network directly manipulates on the input which actually extracted the spatial and temporal characteristics of the frames of the videos. The paper clearly brings out the fact that this scheme of things can identify violent behaviour better than the characteristics of hand-craft features. The metric utilized in the paper was Accuracy. The datasets utilized in this paper are Hockey dataset, Movies dataset, Behave dataset. Mukherjee et al. discussed the anomalies present in videos such as fights [148]. This paper focuses on finding fight scenes in Hockey sport videos using blur & radon transform and convolutional neural networks (CNNs). The local motion parameters within the video frames were extracted using blur information. Thereafter Fast Fourier and radon transform were applied on the local motion. Transfer learning was used using pre-trained deep learning model VGG-Net. The metric utilized in the paper was Accuracy. The dataset used was Hockey dataset. Nova et al. presented a novel approach using machine learning frameworks of Support Vector Machine (SVM) for violence detection [149]. The inputs to the SVM is generated through a novel pose estimation algorithm. A set of popular hand-crafted features based on angles, velocity and contact detection are also fed in to SVM for detection of violent behaviour in a video. The metric utilized in the paper was F1- Score. The dataset used was Social Activity Dataset [150].

Mandal et al. presented a fine-tuned deep convolutional neural residual network for deciphering crowd behaviour [151]. Subclasses of feature maps are constructed which includes violent behaviour too. Subclasses modelling introduces discriminative features. A distinct time-warping technique based on the cosine distance measure is used to approximate the similarity measure between videos. A normal nearest neighbour (NN) classifier is used for classification of crowd behaviour attributes. The metric utilized in the paper was AUC. The dataset used was Social Activity Dataset. Ammar et al. presented a combinatorial approach of LSTM and the features of CNN [152]. The convolutional gates in the LSTM are trained to encode local regions temporal changes. This enables the entire network to encode localized spatiotemporal characteristics. The hierarchical characteristics are extracted from the video frames and then the convolution layers are trained and aggregated using the LSTM layer. The metric utilized in the paper was Accuracy. The dataset used was Hockey, Violent-Flows Dataset and VSD Benchmark. Meng et al. presented a framework of violent behaviour detection by integrating trajectory and deep convolutional neural networks [153]. A variation of CNN known as convNets model is experimented on the UCF101 dataset. The inspiration of developing the ConvNet model is derived from VGG-19 net which consists of 17 convolution pool-norm layers and two fully connected layers. The metric utilized in the paper was Accuracy. The dataset used was Hockey Fights and Violent-Flows dataset.

Zhuang et al. presented Convolutional DLSTM (ConvDLSTM) based deep architecture for crowd scene understanding [154]. ConvDLSTM derives itself from a combination of GoogleNet Inception V3 convolutional neural networks (CNN) and stacked differential long short-term memory (DLSTM) networks. ConvDLSTM optimizes the inherent parameters of CNN and RNN. A raw image is taken as an input devoid of trajectory information. Semantic information of CNN and the memory states of LSTM makes analysis of crowd scene and motion information easier. The metric utilized in the paper was Accuracy. The dataset used were Violent-Flows and CUHK Crowd datasets. Blunsden et al. [155] presented techniques for detection of aggressive activities and crowd violence actions in [155]. It uses a transfer learning-based violence detector. With the help of Lucas–Kanade method an optical flow feature vector is calculated. Thereafter various 2D templates are derived by overlapping optical flow magnitudes and orientations. These templates can be termed as template feature vectors (TFV). These TFV are fed into a pre-trained convolutional neural network and deep features of different layers are extracted. Classic support vector machine (SVM) and k-nearest neighbour classifiers are then trained on above TFV and further classified as violent or non-violent categories. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Hockey dataset, Movies dataset, Violent-flows (ViF) dataset.

Elesawy et al. [156] presented a technique for detection of aggressive activities [156] based on the Bidirectional Convolutional LSTM (BiConvLSTM) architecture. Introduction of Spatiotemporal Encoder with addition of bidirectional temporal encodings and an elementwise max-pooling helps in violence detection. The Spatial Encoder presents considerate performance on the Hockey Fights and Movies datasets. However, on the Violent Flows dataset, the Spatiotemporal Encoder outperforms the Spatial Encoder. The metric utilized in the paper was Accuracy. The datasets utilized in this paper were Hockey dataset, Movies dataset, Violent-flows (ViF) dataset. Table 4. highlights datasets and the CNN specific violence analysis methods along with respective deep learning platforms for crowd analysis.

Table 4

Datasets for Violence Analysis

Dataset Name	Year	Brief description	Format	Creator
BEHAVE	2010	It consists of fight and violence activities generated from surveillance cameras.	4 videos	[155]
Hockey Fight	2011	Hockey players indulging in games and fights.	1,000 clips	[157]
Movies Fight Detection Dataset	2011	Consist of trimmed action movies.	200 clips	[157]
Violence in Movies	2011	200 samples which consist of 100 samples of violent and nonviolent task each.		[157]
UCF101	2012	A dataset consisting of 101 human actions classes from videos in the wild. The span of the dataset is large with over 27 hours of video.	Video, images, text	[158]
Violent Crowd	2012	246 samples which consist of 123 samples of violent and nonviolent task each.	It includes 75 violent videos and 65 non-violent videos	[159]
Violent-Flows	2012	Consist of crowd violence.	246 clips	[159]
MediaEval 2015	2014	Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.	Video	[160]
CUHK	2014	It consists of crowd videos with various densities and perspective scales, collected from many different environments, e.g. streets, shopping malls, airports, and parks.	It consists of 474 video clips from 215 scenes, among which 419 clips.	[161]
VSD	2015	It is derived from complete Hollywood movies	25 movies	[162]
RE-DiD	2015	It consists of urban fights with addition of Cars/Mobiles.	30 videos	[163]
ARENA	2016	Publicly available annotated databases generated with static surveillance cameras containing fight or violence activities.		[164]
Social Activity Dataset	2016	A new dataset of social interaction between the two subjects. RGB and depth images, and tracked skeleton data acquired by an RGBD sensor are a part of this dataset. This dataset has eight social activities consisting of handshake, greeting hug, help walk, help stand-up, fight, push, conversation, call attention.	Video and Images	[150].
Violence-Cross	2016	It is a derived dataset from VIC dataset and CUHK dataset. It includes 300 videos, equally divided into three classes (100 videos for each class). bullet Class 1 having videos of violent behaviours. bullet Class 2 having videos of people walking in opposite directions (cross walk). bullet Class 3 having videos showing actions different than violent and crowd crossing behaviours (e.g., marathon, crowd walking in a same direction).	Video	[142]
Violent interaction dataset	2017	Violent interaction dataset (VID) consists of videos consisting of violent activities.	2314 videos with 1077 fight ones and 1237 no-fight ones.	[134]
Self-built dataset	2018	It consists of 50 violent movies and 20 short videos from YouTube having a video database with resolution of 1024*576 and length of 24 s.	There are 680 videos for the training set, 160 for validation set, and 160 for test set.	[133]
Multiplayer Video Dataset	2018	Derived from Hockey dataset.	500 multiplayer violence videos and 500 multiplayer non-violent videos.	[135]
CCTV-Fights	2019	It consists of urban fights recorded from CCTVs/Mobiles.	1,000 videos	[128]
Real Life Violence Situations Dataset	2019	Real Life Violence Situations Dataset	1000 Violence videos	[156]

5 Comparison of violence analysis approaches

A lot many concepts have been presented on the conceptual approaches of Convolutional Neural Network for violence detection on different datasets. A quantitative result of various methods which claim to solve the problems of violence detection using CNN techniques is discussed. There are various metrics which is used for assessment of the approaches in the literature. Here we delve in to some of the popular universally agreed measures for violence detection model evaluation. The table presents various commonly used metrics while comparing the results in various papers. Different metrics have been used in different methods for the comparison of results. Table 6 summarizes the quantitative results of various CNN based violence detection methods with respect to the datasets utilized in their respective approaches. Majority of the use the metric accuracy for image classification evaluation. Accuracy is defined as the proportion of tweets that has been correctly classified among all image content. Accuracy is a very intuitive metric and is computed using Equation. (1). In the below equation following needs to be considered: $Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100$ (1)

Table 5

Annotation Metric [15]

TP	FN	TN	FP
True Positive	False Negative	True Negative	False Positive

Table 6

Results comparison tables for CNN based approaches for Violence Detection

Reference	Datasets	AUC	Accuracy	mAP	F-Score
[128]	CCTV-Fights	–	–	79.5	75
[130]	Hockey Datasets	–	89.5	–	–
[131]	YouTube	–	–	–	0.95
[132]	Hockey Dataset	–	99.62	–	–
	Movies	–	99.97	–	–
[133]	Hockey Dataset	–	95.40	–	–
	Violent Flow	–	97.97	–	–
	Self-built dataset	–	96.25	–	–
[134]	Violent interaction dataset	–	97.06	–	–
[135]	Multiplayer Video Dataset	–	92.4	–	–
[136]	UCF101	–	90.4	–	–
[137]	BEHAVE	0.9543	–	–	–
[138]	Violent Crowd	–	98.0	–	–
	Violence in Movies	–	99.9	–	–
	Hockey Fight	–	96	–	–
[139]	Violent Crowd	–	–	0.296	–
[140]	Hockey Fight	–	93.9	–	–
[141]	Hockey Fight	–	97.1	–	–
	Movies		100	–	–
	Violent-Flows		94.57	–	–
[142]	MediaEval 2015	–	–	0.253	–
[143]	Violence in Crowds	–	86.61	–	–
	Violence in Movies	–	96.91	–	–
	BEHAVE	–	95.73	–	–
[144]	Multi Task Crowd	0.78	–	–	–
[145]	Violent interaction dataset	–	94.50	–	–
[146]	Hockey	–	94.6	–	–
	Movies	–	99.00	–	–
	Behave	–	91.42	–	–
[148]	Hockey	–	75	–	–
[149]	Social Activity Dataset	–	–	–	0.89
[151]	Activity	0.96	–	–	–
[152]	Hockey	–	98.00	–	–
	Violent-Flows	–	92.19	–	–
	VSD	–	94.57	–	–
[153]	Hockey Fights	–	98.6	–	–
	Crowd Violence	–	92.5	–	–
[154]	Violent-Flows	–	93.59	–	–
	CUHK	–	80.33	–	–
[165]	Hockey Fight	–	94.40	–	–
	Movies	–	96.50	–	–
	Violent-flows	–	80.90	–	–
[166]	Hockey Fight	–	96.96	–	–
	Movies	–	100	–	–
	Violent-flows	–	90.63	–	–

Fig. 15

Modelling proposed for Deployment of Violence Detection in Real Time Systems.

6 Discussion and future research directions

Various approaches to violence detection have been discussed using CNN as a primary parameter. Different approaches have been used using CNN such as two streams, 2D CNN, 3D CNN, spatiotemporal features, a combination of handcrafted as well as deep features, classical machine learning approaches, binary classification, multi class action recognition and so on. We are confident that current research on CNN has prompted the research community to seriously look into CNN based solutions for a real-time violence detection. It can really help society in general and can act as a safety valve for public security. We observe that the CNN based methods surveyed for violence detection here lie in one of the following categories:

The depth of ConvNets and the pre-processed input fed into the deep architecture can change the output.

Different variations of CNN combined with its architectural variations and fusing the classification decisions can also result in a better classification result.

Feature extraction using CNN and novel pre-processing methods and then applying different classifiers to classify the result.

New deep architectures assisted with LSTM, BDLSTM, DNN, Auto Encoders etc. can help to enhance the performance.

Use of transfer learning in violence detection.

One of the major challenges for violence detection is the lack of real-time datasets. Most of the violence activities lie under the domain of abnormality. Abnormal situations are just a subpart of large real-time video. A large pool of multimodal activities needs to be annotated so as to detect the violence activities with the scene contexts too. Transfer learning is the new paradigm shift over which various new cases of classification is being done. Utilization of transfer learning on various datasets can help to increase the performance of the CNN based models. Multiple subcategories of violence can be a classic case for further research. Also, if there can be a method where few training samples with huge amount of labelled information can be used for training to learn the system, the need of building huge training datasets can be done away. The main aim of the survey is to provide an insight into usefulness of CNN based violence detection methods and can generate trigger and alarms for community and policing services for better handling of violence-related tasks. The idea of surveying this violence theme in crowd domain is to encourage the deployment of deep learning framework in actual scenario. Most of these results inculcations are left in to the research domain itself. Our aim is to to make a working model for the web as well mobile platform. Below is a framework of the system that needs to be deployed in order to leverage the violence detection phenomenon. The input shall be any image/video and the models need to be deployed in a standard deep learning framework that has discussed above in the survey. Keras, Tensorflow and Pytorch based models are popular for actually saving the model file and retraining for pretrained models and deployment in actual violence detection module.

The above model can be deployed on cloud server if the data is large and in real time. Suitable hardware consisting of GPU needs to be deployed to detect the violence. Post detection of violence the system can be helpful to the security agencies to identify the cause of violence, making irrefutable proof to nab the culprits, making a database and dataset for further learning of the system. Retraining of the model with fine tuning of hyperparameters to be done as new data arrives. Armed with deployed violence detection framework, the emergency response services can be rushed to the affected area. We must remember that the whole point of deployment is to stop violence from spreading. And if the violence does spreads, the system should contain the violence affected area with automatic classification results, alerts and alarms. A classic surveillance system for monitoring the affected areas will always be helpful in future course of action. One more reason to research the violence detection using ConvNets is that currently it has been proved categorically that visual detection of human actions can play an important roles in affecting public sentiments [166] [167]. The violence detection deployed model proves the point that visual media, visual altercations, violence has a deep impact on the public and impacts the social layers of human society.

7 Conclusion

This survey presents the paper using CNN for violence detection. A fact also emerges that violence detection will become more meaningful with the scene context in which it is evaluated. A subtle discussion on handcrafted features is also ingrained. We also concluded that sometimes the combination of handcrafted as well as deep features combination can increase the classification accuracy. Multiple feature fusion methods also increase the results accuracy on CNN based methods. We have touched upon the need to analyse violence, its motivations and automatic classification of images. We have tried our best to discuss violence, methods, architectures and have a final discussion with respect to crowd domain. Isolated violence between few people and mob violence are entirely different perspective of the violence build-up situations. We have reviewed extensively the CNN based methods for violence detection and come up with a performance comparison. We have also explored majorly used datasets for violence detection. We also identified the challenges of the datasets, CNN approaches and given possible impetus to the pre-processing of data that is fed into CNN. Major research needs to be done within the concept of various crowd definitions of different countries and at different locations. This shall help us to understand CNN based violence detection even better. Complex scene analysis with violence detection algorithm can be a boost for security forces. We hope this survey can bring the appreciation over the increase usage of deep learning processes and efficient usage of Convolutional Neural Network in detection of violent processes. It is important for the researchers to understand the inception and processes of the violent activities in different parts of the world.

References

Hinton

G.E.

, Osindero

and Teh

Y.W.

, A fast learning algorithm for deep belief nets, Neural Computation 18(7) (2006), 1527–1554.

Bengio

, Courville

and Vincent

, Representation Learning: A Review and New Perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8) (2013), 1798–1828.

Deng

, An overview of deep-structured learning for information processing, in Asian-Pacific Signal & Information Processing Annu. Summit and Conf. (APSIPAASC), October 2011.

Vicsek

and Zafeiris

, Collective motion, Physics Reports 517(3) (2012), 71–140.

Hinton

, Deep neural networks for acoustic modelling in speech recognition, IEEE Signal Process Mag 29(6) (2012), 82–97.

and Deng

, Deep learning and its applications to signal and information processing, IEEE Signal Process Mag 28(1) (2011), 145–154.

Arel

, Rose

and Karnowski

, Deep machine learning –a new frontier in artificial intelligence, IEEE Computational Intelligence Mag 5(4) (2010), 13–18.

Deng

, A tutorial survey of architectures, algorithms, and applications for deep learning, APSIPA Transactions on Signal and Information Processing 3, 2014.

Fukushima

, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics 36(4) (1980), 193–202.

10.

S.-C.

, Lou

S.-L.

, Lin

J.-S.

, Freedman

M.T.

, Chien

M.V.

and Mun

S.K.

, Artificial convolution neural network techniques and applications for lung nodule detection, IEEE Transactions on Medical Imaging 14(4) (1995), 711–718.

11.

Lecun

Y.B.L.

, Bengio

and Haffner

, Gradient-based learning applied to document recognition, in Proceedings of the IEEE, 1998.

12.

Krizhevsky

, Sutskever

and Geoffrey

E.H.

, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems 25 (NIPS 2012), 2012.

13.

Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

, Huang

, Karpathy

, Khosla

, Bernstein

, Berg

A.C.

and Fei-Fei

, ImageNet large scale visual recognition challenge, International Journal of Computer Vision 115(3) (2014), 1–42.

14.

Moeslund

T.B.

and Granum

, A survey of computer vision-based human motion capture, Computer Vision and Image Understanding 81(3) (2001), 231–268.

15.

Bishop

C.M.

, Pattern recognition & Machine Learning, 1st ed. 128, New York: Springer-Verlag, 2006, pp. 1–58.

16.

Kephart

J.O.

and Chess

D.M.

, The vision of autonomic computing, Computer 36(1) (2003), 41–50.

17.

Lemley

, Bazrafkan

and Corcoran

, Deep Learning for Consumer Devices and Services: Pushing the limits for machine learning, artificial intelligence, and computer vision, IEEE Consumer Electronics Magazine 6(2) (2017), 48–56.

18.

Leo

, Medioni

, Trivedi

, Kanade

and Farinella

, Computer vision for assistive technologies, Computer Vision and Image Understanding 15 (2017), 1–15.

19.

Liu

, Wang

, Nasrabadi

and Huang

, Learning a Mixture of Deep Networks for Single Image Super-Resolution, in Asian Conference on Computer Vision, 2017.

20.

Wing

J.M.

, Computational thinking, Commun ACM 49(3) (2006), 33–35.

21.

Sun

and Fisher

, Object-based visual attention for computer vision, Artificial Intelligence 146(1) (2003), 77–123.

22.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2652990/, [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2652990/. [Accessed 11 07 2019].

23.

Schmidhuber

, Deep learning in neural networks: an overview, Neural Networks 61 (2015), 85–117.

24.

, Wang

, Kuen

, Ma

, Shahroudy

, Shuai

, Liu

, Wang

and Wang

, Recent advances in convolutional neural networks, eprint arXiv:1512.07108, Dec 2015.

25.

LeCun

, Bengio

and Hinton

, Deep learning, Nature 521 (2015), 436–444.

26.

Bengio

, Lamblin

, Popovici

and Larochelle

, Greedy layer-wise training of deep networks, in International Conference on Neural Information Processing Systems, 2007.

27.

Hinton

G.E.

and Salakhutdinov

R.R.

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

28.

Vincent

, Larochelle

, Lajoie

, Bengio

and Manzagol

P.-A.

, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research 11 (2010), 3371–3408.

29.

Hubel

D.H.

and Wiesel

T.N.

, Receptive fields and functional architecture of monkey striate cortex, The Journal of Physiology 195(1) (1968), 215–243.

30.

LeCun

, Cortes

and Burges

C.J.

, MNIST handwritten digit database, 2010.

31.

Weisstein

E.W.

, Convolution. From MathWorld–A Wolfram Web Resource, 2009.

32.

Zeiler

M.D.

and Fergus

, Visualizing and understanding convolutional networks, in ECCV, 2014.

33.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014.

34.

Szegedy

, et al., Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015.

35.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, eprint arXiv:1512.03385, 2015.

36.

, Zhang

, Ren

and Sun

, Identity mappings in deep residual networks, in European Conference on Computer Vision, Amsterdam, 2016.

37.

Zagoruyko

and Komodakis

, Wide residual networks, arXiv preprint arXiv:1605.07146, 2016.

38.

Singh

, Hoiem

and Forsyth

, Swapout: Learning an ensemble of deep architectures, arXiv preprint arXiv:1605.06465, 2016.

39.

Targ

, Almeida

and Lyman

, Resnet in resnet: Generalizing residual architectures, arXiv preprint arXiv:1603.08029, 2016.

40.

Zhang

, Sun

, Han

T.X.

, Yuan

, Guo

and Liu

, Residual networks of residual networks: Multilevel residual networks, IEEE Transactions on Circuits and Systems for Video Technology PP(99), 2016.

41.

Wang

, Jiang

, Qian

, Yang

, Li

, Zhang

, Wang

and Tang

, Residual attention network for image classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

42.

Mikolov

, Karafiát

, Burget

, Černockỳ

and Khudanpur

, Recurrent neural network based language model, in Eleventh annual conference of the international speech communication association, 2010.

43.

Sundermeyer

, Schl“uter

and Ney

, LSTM neural networks for language modeling, in Thirteenth annual conference of the international speech communication association, 2012.

44.

Tripathi

, Singh

and Vishwakarma

D.K.

, Convolutional neural networks for crowd behaviour analysis: a survey, The Visual Computer 35(5) (2019), 753–776.

45.

Ngiam

, Chen

, Chia

, Koh

P.W.

, Le

Q.V.

and Ng

A.Y.

, Tiled convolutional neural networks, in NIPS, 2010.

46.

Wang

and Oates

, Encoding time series as images for visual inspection and classification using tiled convolutional neural networks, in AAAI Workshop, 2015.

47.

and Koltun

, Multi-scale context aggregation by dilated convolutions, in ICLR, 2016.

48.

Kalchbrenner

, Espeholt

, Simonyan

, Oord

, Graves

and Kavukcuoglu

, Neural machine translation in linear time, arXiv preprint arXiv:1610.10099, 2016.

49.

Sercu

and Goel

, Dense prediction on sequences with time-dilated convolutions for speech recognition, in NIPS Workshop, 2016.

50.

Oord

V.D.

, Dieleman

, Zen

, Simonyan

, Vinyals

, Graves

, Kalchbrenner

, Senior

and Kavukcuoglu

, Wavenet:Agenerative model for rawaudio, arXiv preprint arXiv:1609.03499, 2016.

51.

Lin

, Chen

and Yan

, Network in network, arXiv:1312.4400, 2013.

52.

Szegedy

, Ioe

, Vanhoucke

and Alemi

, Inceptionv4, Inception-ResNet and the impact of residual connections on learning, arXiv:1602.07261, 2016.

53.

Long

, Shelhamer

and Darrell

, Fully convolutional networks for semantic segmentation, arXiv:1411.4038, 2015.

54.

Zeiler

M.D.

, Krishnan

, Taylor

G.W.

and Fergus

, Deconvolutional networks, in CVPR, 2010.

55.

Zeiler

M.D.

, Taylor

G.W.

and Fergus

, Adaptive deconvolutional networks for mid and high level feature learning, in ICCV, 2011.

56.

Bruna

, Szlam

and LeCun

, Signal recovery from pooling representations, eprint arXiv:1311.4025, 2014.

57.

Gulcehre

, Cho

, Pascanu

and Bengio

, Learned-norm pooling for deep feedforward and recurrent neural networks, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2014.

58.

Simoncelli

E.P.

and Heeger

D.J.

, A model of neuronal responses in visual area MT, Vision Research 38(5) (1998), 743–761.

59.

Hyvärinen

and Köster

, Complex cell pooling and the statistics of natural images, Network: Computation in Neural Systems 18(2) (2007), 81–100.

60.

Hinton

G.E.

, Srivastava

, Krizhevsky

, Sutskever

and Salakhutdinov

R.R.

, Improving neural networks by preventing co-adaptation of feature detectors, eprint arXiv:1207.0580, 2012.

61.

Wan

, Zeiler

, Zhang

, Cun

Y.L.

and Fergus

, Regularization of neural networks using dropconnect, in PMLR, 2013.

62.

Zeiler

M.D.

and Fergus

, Stochastic pooling for regularization of deep convolutional neural networks, eprint arXiv:1301.3557, 2013.

63.

Rippel

, Snoek

and Adams

R.P.

, Spectral representations for convolutional neural networks, in NIPS, Montreal, 2015.

64.

Gong

, Ke

, Isard

and Lazebnik

, A multi-view embedding space for modeling internet images, tags, and their semantics, Int J Comput Vision 106(2) (2014), 210–233.

65.

Jégou

, Perronnin

, Douze

, Sanchez

, Perez

and Schmid

, Aggregating local image descriptors into compact codes, IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9) (2012), 1704–1716.

66.

Zhang

, Zhao

, Feng

and Lyu

, AlphaMEX: A smarter global pooling method for convolutional neural networks, Neurocomputing 321 (2018), 36–48.

67.

Nair

and Hinton

G.E.

, Rectified linear units improve restricted boltzmann machines, in International Conference on International Conference on Machine Learning, Haifa, 2010.

68.

Maas

A.L.

, Hannun

and Ng

A.Y.

, Rectifier nonlinearities improve neural network acoustic models, in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.

69.

Goodfellow

I.J.

, Warde-Farley

, Mirza

, Courville

and Bengio

, Maxout networks, in International Conference on Machine Learning, Atlanta, 2013.

70.

Springenberg

J.T.

and Riedmiller

, Improving deep neural networks with probabilistic maxout units, arXiv preprint arXiv:1312.6116, 2013.

71.

, Zhang

, Ren

and Sun

, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in IEEE International Conference on Computer Vision, 2015.

72.

, Wang

, Chen

and Li

, Empirical evaluation of rectified activations in convolutional network, arXiv preprint arXiv:1505.00853, 2015.

73.

Clevert

D.-A.

, Unterthiner

and Hochreiter

, Fast and accurate deep network learning by exponential linear units (elus), arXiv preprint arXiv:1511.07289, 2015.

74.

Zhao

, Lyu

, Zhang

and Feng

, Multiactivation Pooling Method in Convolutional Neural Networks for Image Recognition, Wireless Communications and Mobile Computing, 2018, 2018.

75.

Bottou

, Large-scale machine learning with stochastic gradient descent, in International Conference on Computational Statistics (COMPSTAT’2010), 2010.

76.

Wijnhoven

R.G.

, and P.H.N.d.With, Fast training of object detection using stochastic gradient descent, in Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010, 2010.

77.

Duchi

, Hazan

and Singer

, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011), 2121–2159.

78.

Zinkevich

M.A.

, Weimer

, Smola

and Li

, Parallelized stochastic gradient descent, in NIPS, Vancouver, 2010.

79.

Recht

, Re

, Wright

and Niu

, Hogwild:A lock-free approach to parallelizing stochastic gradient descent, in NIPS, 2011.

80.

Bengio

, Deep learning of representations: Looking forward, in International Conference on Statistical Language and Speech Processing, 2013.

81.

Dean

, Corrado

G.S.

, Monga

, Chen

, Devin

, Le

Q.V.

, Mao

M.Z.

, Ranzato

, Senior

, Tucker

, Yang

and Ng

A.Y.

, Large scale distributed deep networks, in NIPS, Lake Tahoe, Nevada, 2012.

82.

Zhuang

, Chin

W.-S.

, Juan

Y.-C.

and Lin

C.-J.

, A fast parallel sgd formatrix factorization in sharedmemory systems, in ACM conference on Recommender systems, Hong Kong, 2013.

83.

Thoma

, Analysis and Optimization of Convolutional Neural Network Architectures, arXiv preprint arXiv:1707.09725, 2017.

84.

Ooi

B.C.

, et al., SINGA: A distributed deep learning platform, in ACM international conference on Multimedia, Brisbane, 2015.

85.

Jia

, et al., Caffe: Convolutional architecture for fast feature embedding, in ACM international conference on Multimedia, Orlando, 2014.

86.

http://deeplearning4j.org/ Last visited 27.05.2017., [Online].

87.

King

D.E.

, Dlib-ml: A machine learning toolkit, Journal of Machine Learning Research 10 (2009), 1755–1758.

88.

Seide

, Keynote: The computer science behind the Microsoft Cognitive Toolkit: An open source large-scale deep learning toolkit for Windows and Linux, in IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2017.

89.

Chen

, et al., Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274, 2015.

90.

Lopez

, Open NN: An Open Source Neural Networks C++ Library [software], 2014.

91.

Abadi

, Agarwal

, Barham

, Brevdo

, Chen

, Citro

and Ghemawat

, TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems, arXiv preprint arXiv:1603.04467, 2016.

92.

Bastien

, Lamblin

, Pascanu

, Bergstra

, Goodfellow

, Bergeron

, Bouchard

, Warde-Farley

and Bengio

, Theano: new features and speed improvements., in Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

93.

Collobert

K.K.C.F.R.

, Torch7: A matlab-like environment for machine learning, in BigLearn, NIPS Workshop (No. EPFL-CONF- 192376), 2011.

94.

https://www.collinsdictionary.com/dictionary/english/violence, [Online]. [Accessed 12 6 2019].

95.

, Moore

B.E.

and Shah

, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes, in IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010.

96.

Zitouni

M.S.

, Bhaskar

, Dias

and Al-Mualla

, Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques, Neurocomputing 186 (2016), 139–159.

97.

Rodriguez

, Laptev

, Sivic

and Audibert

J.Y.

, Density-aware person detection and tracking in crowds, in IEEE International Conference on Computer Vision, 2011.

98.

, Song

, Wu

, Li

, Feng

and Qian

, Video anomaly detection based on a hierarchical activity discovery within spatio-temporal context, Neurocomputing 143 (2014), 144–152.

99.

Cheng

, Qin

, Huang

, Yan

and Tian

, Recognizing human group action by layered model with multiple cues, Neurocomputing 136 (2014), 124–135.

100.

Liang

, Zhu

and Wang

, Counting crowd flow based on feature points, Neurocomputing 133 (2014), 377–384.

101.

Zhan

, Monekosso

D.N.

, Remagnino

, Velastin

S.A.

and Xu

L.-Q.

, Crowd analysis: a survey, Machine Vision and Applications, Machine Vision and Applications 19(5-6) (2008), 345–357.

102.

Rabiee

, Haddadnia

, Mousavi

, Nabi

, Murino

and Sebe

, Emotion-based crowd representation for abnormality detection, arXiv preprint arXiv:1607.07646, 2016.

103.

Sultani

, Chen

and Shah

, Real-World Anomaly Detection in Surveillance Videos, in he IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

104.

http://www.desibrandstrategy.com/why-tirupati-tirumala-needs-smarter-analytics/, [Online]. Available: http://www.desibrandstrategy.com/why-tirupati-tirumala-needs-smarter-analytics/. [Accessed 17 June 2017].

105.

https://www.indiatoday.in/sports/fifa-world-cup-2018/story/2018-fifa-world-cup-racism-crowd-violence-major-hurdles-for-russia-1248643-2018-06-02, [Online]. Available: https://www.indiatoday.in/sports/fifa-world-cup-2018/story/2018-fifa-world-cup-racism-crowd-violence-major-hurdles-for-russia-1248643-2018-06-02.

106.

https://p.motionelements.com/stock-video/people/me4881502-violent-riots-car-fire-pan-to-crowd-hd-a0005.jpg, [Online]. Available: https://p.motionelements.com/stock-video/people/me4881502-violent-riots-car-fire-pan-to-crowd-hd-a0005.jpg. [Accessed 1 July 2019].

107.

https://www.channelnewsasia.com/news/asia/indonesia-jakarta-riot-how-protests-turned-violent-11554038, [Online]. Available: https://www.channelnewsasia.com/news/asia/indonesia-jakarta-riot-how-protests-turned-violent-11554038. [Accessed 1 July 2019].

108.

Dimokranitou

and Tsechpenakis

, Adversarial Autoencoders for Anomalous Event Detection in Images, 2017.

109.

Saxena

, Crowd behavior recognition for video surveillance, in International Conference on Advanced Concepts for Intelligent Vision Systems, 2008.

110.

Husni

and Suryana

, Crowd event detection in computer vision, in International Conference on Signal Processing Systems (ICSPS), 2010.

111.

Mehran

, Oyama

and Shah

, Abnormal crowd behavior detection using social force model, in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2009.

112.

Junior

J.C.S.J.

, Musse

S.R.

and Jung

C.R.

, Crowd analysis using computer vision techniques, IEEE Signal Processing Magazine 27(5) (2010), 66–77.

113.

Teixeira

, Dublon

and Savvides

, A survey of human-sensing: Methods for detecting presence, count, location, track, and identity, ACM Computing Surveys 5(1) (2010), 59–69.

114.

Loy

C.C.

, Chen

, Gong

and Xiang

, Crowd counting and profiling: Methodology and evaluation, in Modeling, Simulation and Visual Analysis of Crowds: A Multidisciplinary Perspective, New York, Springer New York, 2013, pp. 347–382.

115.

Ferryman

and Ellis

A.-L.

, Performance evaluation of crowd image analysis using the PETS2009 dataset, Pattern Recognition Letters 44 (2014), 3–15.

116.

, Chang

, Wang

, Ni

, Hong

and Yan

, Crowded scene analysis: A survey, IEEE Transactions on Circuits and Systems for Video Technology 25(3) (2015), 367–386.

117.

Ryan

, Denman

, Sridharan

and Fookes

, An evaluation of crowd counting methods, features and regression models, Computer Vision and Image Understanding 130 (2015), 1–17.

118.

Saleh

S.A.M.

, Suandi

S.A.

and Ibrahim

, Recent survey on crowd density estimation and counting for visual surveillance, Engineering Applications of Artificial Intelligence 41 (2015), 103–114.

119.

Grant

J.M.

and Flynn

P.J.

, Crowd scene understanding from video: a survey, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13(2) (2017), 1–23.

120.

Sindagi

V.A.

and Patel

V.M.

, A survey of recent advances in cnn-based single image crowd counting and density estimation, Pattern Recognition Letters 107 (2018), 3–16.

121.

Hughes

R.L.

, The flow of human crowds, Annual Review of Fluid Mechanics 35(1) (2003), 169–182.

122.

Leggett

, Real-time crowd simulation: A review, 2004.

123.

Fisher

, The perfect swarm: The science of complexity in everyday life, Basic Books, 2009.

124.

Moore

B.E.

, Ali

, Mehran

and Shah

, Visual crowd surveillance through a hydrodynamics lens, Commun ACM 54(12) (2011), 64–73.

125.

Sjarif

N.N.A.

, Shamsuddin

S.M.

and Hashim

S.Z.

, Detection of abnormal behaviors in crowd scene: a review, Int J Advance Soft Comput Appl 4(1) (2012), 1–33.

126.

Thida

, Yong

, Climent-Pérez

, Eng

and Remagnino

, A literature review on video analytics of crowded scenes, in Intelligent Multimedia Surveillance, Springer Berlin Heidelberg, 2013, pp. 17–36.

127.

, Chug

and Sethi

, A review of physics-based methods for group and crowd analysis in computer vision, Journal of Postdoctoral Research 1(1) (2013), 4–7.

128.

Perez

, Kot

A.C.

and Rocha

, Detection of Real-world Fights in SurveillanceVideos, in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

129.

Karpathy

, Toderici

, Shetty

, Leung

, Sukthankar

and Fei-Fei

, Large-scale video classification with convolutional neural networks, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014.

130.

Ding

, Fan

, Zhu

, Feng

and Jia

, Violence detection in video by using 3D convolutional neural networks, in International Symposium on Visual Computing, 2014.

131.

Sumon

S.A.

, Shahria

M.T.

, Goni

M.R.

, Hasan

, Almarufuzzaman

and Rahman

R.M.

, Violent Crowd Flow Detection Using Deep Learning, in Asian Conference on Intelligent Information and Database Systems, 2019.

132.

Song

, Zhang

, Zhao

, Yu

, Zheng

and Wang

, A Novel Violent Video Detection Scheme Based on Modified 3D Convolutional Neural Networks, IEEE Access 7 (2019), 39172–39179.

133.

, Wu

, Wang

and Wang

, Violent Video Classification Based on Spatial-Temporal Cues Using Deep Learning, in 2018 11th International Symposium on Computational Intelligence and Design (ISCID), 2018.

134.

Zhou

, Ding

, Luo

and Hou

, Violent interaction detection in video based on deep learning, Journal of Physics: Conference Series 844 (2017), 012044.

135.

, Zhu

, Chen

, Pan

, Li

and Wang

, End-to-end Multiplayer Violence Detection based on Deep 3D CNN, in Proceedings of the 2018 VII International Conference on Network, Communication and Computing, 2018.

136.

Tran

, Bourdev

, Fergus

, Torresani

and Paluri

, Learning spatiotemporal features with 3d convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2015.

137.

Baba

, Gui

, Cernazanu

and Pescaru

, A sensor network approach for violence detection in smart cities using deep learning, Sensors 19 (2019), 1676.

138.

Ullah

F.U.M.

, Ullah

, Muhammad

, Haq

I.U.

and Baik

S.W.

, Violence detection using spatiotemporal features with 3D convolutional neural network, Sensors 19(11) (2019), 2472.

139.

Dai

, Zhao

R.-W.

, Wu

, Wang

, Gu

, Wu

and Jiang

Y.-G.

, Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning, in MediaEval, 2015.

140.

Dong

, Qin

and Wang

, Multi-stream deep networks for person to person violence detection in videos, in Chinese Conference on Pattern Recognition, 2016.

141.

Sudhakaran

and Lanz

, Learning to detect violent videos using convolutional long short-term memory, in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017.

142.

, Cao

and Jin

, Violent scene detection using convolutional neural networks and deep audio features, in Chinese Conference on Pattern Recognition, 2016.

143.

Mohammadi

, Perina

, Kiani

and Murino

, Angry crowds: Detecting violent events in videos, in European Conference on Computer Vision, 2016.

144.

Marsden

, McGuinness

, Little

and O’Connor

N.E.

, Resnetcrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification, in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017.

145.

Fenil

, Manogaran

, Vivekananda

, Thanjaivadivel

, Jeeva

, Ahilan

, and others, Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM, Computer Networks 151 (2019), 191–200.

146.

Serrano

, Deniz

, Espinosa-Aranda

J.L.

and Bueno

, Fight recognition in video using hough forests and 2D convolutional neural network, IEEE Transactions on Image Processing 27(10) (2018), 4787–4797.

147.

Zhou

, Zhu

and Yahya

, Violence Behavior Detection Based on 3D-CNN, Computer Systems & Applications 12 (2017), 34.

148.

Mukherjee

, Saini

, Kumar

, Roy

P.P.

, Dogra

D.P.

, Kim

B.-G.

, and others, Fight detection in hockey videos using deep network, Journal of Multimedia Information System 4(4) (2017), 225–232.

149.

Nova

, Ferreira

and Cortez

, A Machine Learning Approach to Detect Violent Behaviour from Video, in International Conference on Intelligent Technologies for Interactive Entertainment, 2018.

150.

Coppola

, Faria

D.R.

, Nunes

and Bellotto

, Social activity recognition based on probabilistic merging of skeleton features with proximity priors from rgb-d data, in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016.

151.

Mandal

, Fajtl

, Argyriou

, Monekosso

and Remagnino

, Deep residual networkwith subclass discriminant analysis for crowd behavior recognition, in 2018 25th IEEE International Conference on Image Processing (ICIP), 2018.

152.

Ammar

, Anjum

, Rounak

and Islam

, Touhidul and others, Using deep learning algorithms to detect violent activities, 2019.

153.

Meng

, Yuan

and Li

, Trajectory-Pooled Deep Convolutional Networks for Violence Detection in Videos, in International Conference on Computer Vision Systems, 2017.

154.

Zhuang

, Ye

and Hua

K.A.

, Convolutional DLSTM for crowd scene understanding, in 2017 IEEE International Symposium on Multimedia (ISM), 2017.

155.

Blunsden

and Fisher

R.B.

, The BEHAVE video dataset: ground truthed video for multi-person behavior classification, Annals of the BMVA 4 (2010), 1–12.

156.

Elesawy

, Hussein

and MIna

A.E.M.

, https://www.kaggle.com/mohamedmustafa/real-life-violence-situations-dataset, [Online].

157.

Nievas

E.B.

, Suarez

O.D.

, Garcí a

G.B.

and Sukthankar

, Violence detection in video using computer vision techniques, in International conference on Computer analysis of images and patterns, 2011.

158.

Soomro

, Zamir

A.R.

and Shah

, UCF101: A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402, 2012.

159.

Hassner

, Itcher

and Kliper-Gross

, Violent flows: Real-time detection of violent crowd behavior, in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012.

160.

Demarty

C.-H.

, Ionescu

, Jiang

Y.-G.

, Quang

V.L.

, Schedl

and Penet

, Benchmarking violent scenes detection in movies, in 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), 2014.

161.

Shao

, Change

L.C.

and Wang

, Scene-independent group profiling in crowd, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.

162.

Demarty

C.-H.

, Penet

, Soleymani

and Gravier

, VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation, Multimedia Tools and Applications 74(17) (2015), 7379–7404.

163.

Rota

, Conci

, Sebe

and Rehg

J.M.

, Real-life violent social interaction detection, in 2015 IEEE International Conference on Image Processing (ICIP), 2015, pp. 3456–3460.

164.

Patino

, Cane

, Vallee

and Ferryman

, Pets 2016: Dataset and challenge, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016.

165.

Keçeli

and Kaya

, Violent activity detection with transfer learning method, Electronics Letters 53(15) (2017), 1047–1048.

166.

Hanson

, Pnvr

, Krishnagopal

and Davis

, Bidirectional Convolutional LSTM for the Detection of Violence in Videos, in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

167.

Sreenu

and Durai

M.S.

, Intelligent video surveillance: a review through deep learning techniques for crowd analysis, Journal of Big Data 6(1) (2019), 48.