Deep Vigilante: A deep learning network for real-world crime detection

Abstract

Identification/recognition of assault, fighting, shooting, and vandalism from video sequence using deep 2D and 3D convolutional neural networks (CNNs) is explored in this paper. Recent wave of extensive unrestricted urbanization has not only uplifted the standard of living, but has also threatened the safety of a common man leading to an extraordinary rise in crime rate. Although Closed-circuit television (CCTV) footage provides a monitoring framework, yet, it’s useless without an auto volume crime detection system. The system proposed in this work is an effort to eradicate volume crimes through accurate detection in real-time. Firstly, a fine-grained annotated dataset including instance and activity information has been developed for real-world volume crimes. Secondly, a comparison between 3D CNN and 2D CNN network has been presented to identify the malicious event from the video sequence. This is carried out to explore the significance of spatial and temporal information present in the video for event recognition. It has been observed that 2D CNN even with lesser parameters achieved a promising classification accuracy of 91.2%and Area under the curve (AUC) of 95.2%on four classes. The system also reduces false alarm rate in comparison to state-of-the-art approaches.

Keywords

Convolutional neural network spatio-temporal features malicious activity detection deep learning

1 Introduction

Global urbanization is elevating at an expeditious pace due to frequent immigration of people from rural to urban areas. Urban population was one third of the global population in 1964 which has increased to 54%by 2014 and is expected to progress to two thirds of the global population by 2050. Even though urbanization provided us with significant economic and social transformation, it was on the expense of certain unanticipated consequences e.g. inadequate housing, water and sanitation, transport, health care services, and safety & security [36].

Among all others repercussions, security of the citizens is considered significant in shaping the society. Safer countries attract considerably progressive economies. For ensuring citizen security, various initiatives have been taken to mitigate crimes usually through establishment of efficient security measure. Recently safe-city concept has been introduced to maintain law and order situation throughout urban areas. To cover the entire locality, CCTV cameras are installed at decisive locations across the city. In order to reduce the impact of crimes recorded by CCTV cameras, timely detection of the activity is needed for prompt actions by the concerned authorities. The network of CCTV cameras is monitored through a central control room operated round-the-clock by human observers. However, firstly a large expert task force is required to monitor these hundreds of video streams. Secondly, the probability of detecting anomalous activities decreases with increase in number of video streams and the time of attention. According to [1] an operator may efficiently monitor a video stream for about 12 minutes continuously, after which he may miss upto 45%of screen activity. After 22 minutes this miss-rate may even elevate to 95%. Thus, to improve the efficiency of surveillance systems an autonomous technological solution is required to continuously analyze the CCTV recordings.

Researchers extensively started to explore the area of abnormal behavior detection recently. Initially, various definitions of abnormalities were proposed through expert systems to address the solution-oriented technicalities of the problem. Abnormal events were identified as the presence of prohibited entities (car, bicycle, skateboard) in public walkways by Amraee et al. [2, 8]. Liu et al. considered throwing objects and walking in the wrong direction as an abnormal behavior [7, 24]. Furthermore, [38 –40] recognized running, chasing, and loitering as an abnormal event. Also, [30, 46] has focused on the detection of crowd and considered the presence of crowd in scene as an abnormality. The discussed actions are although normal daily life routine activities, however, if performed in a prohibited premises then they are considered as an anomalous activity. Such actions are identified as abnormal by their context of performance. For instance, bicycling is a normal action unless and until it’s performed in the public walkways. Similarly, while running in the playground is normal, it’s nonetheless a serious offense in the parade ground. In order to make the city surveillance system more effective. it is desired to detect real-world abnormalities.

Considering the importance of detecting real-world anomalies, Nieto et al. considered violence in a crowd and purse snatching as abnormality [33]. This study only targeted crowd behavior analysis. A group of researchers focused on the detection of person-on-person attack [46] as an anomalous behavior. The mentioned approach is suitable only for detecting fights among two persons only and cannot be generalized for different types of fights among a group of individuals. Similarly, [47] identified abuse, accident, and fighting as an abnormality. In this case, although the author tried to cover multiple abnormalities, yet it lacks the concept that an accident detection system needs to be deployed on highways. Whereas, fighting and abuse are the activities least expected on highways. So, a unified system for detecting such abnormalities lacks the utility of a system in terms of deployment. The maximum numbers of real-world anomalies are reported in [21, 44] listing 13 real-world anomalies including: arrest, arson, assault, fighting, shooting, vandalism, burglary, accident, explosion, abuse, robbery, stealing, and shoplifting. The proposed setting contains a very diverse set of classes and failed to achieve a promising detection accuracy. In this paper, we have exploited a subset of volume crimes (assault, fight, shooting, vandalism) which can be of the obvious threat to the life and property of the population. Some of these examples are illustrated in Fig. 1. These are common abnormalities that occurs in the streets of an urban area. A unified system is desired to address the selected subset of volume crimes.

Fig. 1

Sequence of frames taken from video stream can related to different malicious events e.g Assault, Fighting, Shooting, and Vandalism.

Various approaches have been entertained to develop a system for automatic detection of abnormal behaviors in CCTV recording. The initial studies in abnormal event detection were focused on object tracking [4 , 50], where a moving object is considered as abnormal if its trajectory doesn’t follow the fitted model during the training period. Trajectory analysis can perform well in case of an individual moving object in a scene but is less effective for complex and crowded scenes. Such efforts are less effective in tracking the motion of abnormal shapes. Handcrafted feature extraction techniques are also exploited for anomaly detection [49, 51]. The fundamental problem with the mentioned approach is the selection of efficacious features which was resolved through deep features by Gong et al. [15]. They have used unsupervised deep learning-based features for addressing anomaly detection. Usually, deviation from the normal is considered as an anomaly in unsupervised learning; however, this may not be the case due to the existence of a fine line between normal and abnormal behaviors, which results in a large number of false alarms. The strongest approach used so far is supervised deep learning algorithms. In supervised deep learning techniques, various labeled datasets are used for detection of a particular group of activities[47]. The latest approach used by Sultani et al. is Multiple Instance Learning (MIL) for real-world anomaly detection. They introduced an assorted dataset of 13 real-world malicious activities. They managed to achieve a classification accuracy of 28%. The developed dataset contains a very diverse set of classes. Nevertheless, class labels are assigned to whole videos while only a part of these videos contained the occurrence of the actual event. This causes MIL to perform poorly leading to low classification accuracy.

The work presented here provide a mechanism where a properly labeled dataset is developed in which each video instance is labeled based on the type of activity present in it to make it suitable for supervised learning methodology. In addition to properly arranging the training datasets, the most effective concept of intermediate frame fusion and early frame fusion for Spatio-temporal analysis of videos has been adopted and deployed using basic 2D, and 3D CNN.

This research work has three tier contributions including:

Development of dataset for malicious instance recognition problem.

3D-CNN and 2D-CNN based models developed for recognition of anomalous instances in real-time CCTV recording.

Ablative study to compare the efficiency of proposed models for optimal performance.

The best performance achieved a promising accuracy of 91.2%. It is also observed that the models are capable of processing above 1000 frames per second, thus could be employed in real-time application easily.

The remainder of this paper is organized as follows: Section 2 reviews the related work in anomalous activity detection. Section 3 presents the architectures of the proposed models including problem formulation and its implementation details. 4 explains the experimental setup including dataset while section 5 provides the results and analysis. Section 6 concludes the paper.

2 Related work

In this section, we will be discussing various approaches entertained in past for malicious activity detection using Spatio-temporal CNNs. Generally, malicious activities are anomalous parts of the video sequence; however, it could be observed in the literature that maliciousness of the activity in the given video sequence has been contextually specified according to the target application. According to [50], the definition of malicious activities changes with the scenario. It is a complicated task to differentiate the malicious event from the rest of the video recording at the given instance. Some of the malicious events identified in CCTV footages have been shown in Fig. 1. The following subsections present the overview of various approaches used for the detection of context-sensitive anomalies and established anomalies.

2.1 Context sensitive anomalies

Object tracking as abnormal motion. Motion in restricted areas, running, and loitering are often considered as an anomaly in a sensitive environment. Such type of movements in a video sequence are detected by various trajectory analysis techniques [19, 50]. Calderara et al. considered inter node transition pattern in a graph as trajectory for abnormal motion detection [4] while Morris et al. found the inter resting node using Gaussian Mixture Model and then Hidden Markov Model for the same purpose [29]. Moreover, techniques based on low level local features have been used for detection of abnormal motion [3 , 51]. Ermis et al. constructed a probabilistic model for abnormal motion detection by generating behaviour cluster derived from behaviour profile [12]. Reddy et al. exploited ground truth segmentation in combination with the motion and size feature modeled by kernel density estimation [41]. This approach claims its effectiveness in detecting abnormal object in crowded scenes. Xiao et al. used hybrid combination of sparse semi non-generative matrix factorization (SSMF) and histogram of non-negative coefficient (HNC) for anomaly detection in surveillance videos [51]. In their work only normal data is used for parameter tuning and deviation from normal motion is considered as an anomaly. Li et al. [22] proposed a spatio-temporal model for anomaly detection in complex and crowded scene. Dynamic texture model in combination is used for considering both dynamics and appearance information. In proposed model spatial saliency score is computed using a center-surround discriminant. Whereas, temporal saliency score is produced using a model of normal behaviour learned from data. Although, all of these techniques performed well for abnormal motion detection such as running in a scene and walking in the wrong direction, however, they are specifically designed for tracking objects in image sequence.

Falling object as abnormal motion. A more complicated scenario considering the safety of aged citizen would be tracking the human fall. Autonomous systems are developed for timely detection of human’ fall in hospitals, homes and old age facilities. Based on the type of hardware used for human fall detection the methods used can be broadly divided into two classes: wearable sensor based system and ambient sensor (camera) bases systems. Wearable sensor based system is non-scalable as it need sensors (accelerometer, barometric pressure sensor, tilt switch, and gyroscope) to be installed on body of each individual [18, 35]. Considering the scalability of ambient sensor based system. Various approaches have been used for human fall detection from videos. Various handcrafted feature based techniques and deep-neural framework have been used [26 , 53]. Zhang et al. proposed to use deformation of the joint and subject height as feature and then used k-nearest neighbours (KNN) in combination with support vector machine (SVM) [53]. Similarly, Ma et al. used a bag of curvature scale space of human silhouette followed by extreme learning machine (ELM) [27]. Stone et al. proposed a technique based on use of prominent features (vertical velocity, vertical acceleration, frame to frame vertical velocity) and ensemble decision trees [42]. Furthermore, considering the issue of choice of handcrafted features, Lu et al. [26] proposed the use of spatio-temporal feature learned in a training process followed by LSTM. Similarly, Chen et al. in [6] proposed a framework for fall detection in complex environment. The proposed framework uses mask RCNN for the extraction of moving object from complex environment and then Bi-directional LSTM is used for fall event detection [6]. All these approaches have been specifically designed for fall-detection which may have important utilities in indoor situations e.g. smart homes, old-age houses, and patient monitoring. However, they contribute least towards outdoor activities considered anomalous.

Violence detection. Violence either detected indoor or outdoor is generally regarded as malicious and is mostly harmful. Violence may be verbal or physical therefore, it could be detected through audio, visual or both features. Jeho et al. proposed the use of visual features for detection of flame, blood, and explosion for characterizing violent and non-violent scenes [31]. Others have used low-level visual, auditory features and high-level audio effects for detecting violence in movies [13 , 52]. Karan et al. has considered person on person attack and crazy crowd as violence [14, 43]. Various approaches based on use of low level-feature, trajectory tracking and deep neural network have been used for violence detection. Nguyen et al. has used the motion and limb orientation for trajectory tracking [32]. Hierarchical hidden markov model (HMM), violent flow descriptor in combination with support vector machine (SVM) has also been used for violence detection [17, 28]. Furthermore, [5 , 34] has proposed various spatio-temporal descriptors (space time interest point, scale invariant feature transform (SIFT), and motion SIFT) for effective detection of violent activities. Football hooliganism is detected using deep learning through bidirectional LSTM [14]. Ullah et al. [46] proposed a spatio-temporal convolutional neural network for detecting violence in crowded scene as well as person on person attack. The mentioned approaches used for violence detection are limited to detect violence in crowd and person on person attacks. To generalize the detection process researchers focused on the area of abnormal human behaviour detection.

2.2 Established anomalies

Anomaly detection remained in focus for the last decade to detect the abnormal human behaviour in surveillance videos. A group of researchers working in the area of ensuring safety of pedestrian walkways focused on the detection of non-pedestrian entities in public walkways [22 , 45]. Li et al. [22] proposed the use of spatio-temporal information. Center-surround discriminant saliency detector and normal behaviour model for extracting spatial and temporal saliency score respectively, for the categorization of pedestrian abnormalities. Tahboub et al. [45] used local binary pattern in combination with random forest for detecting pedestrian anomalies. Ravan et al. [39] used generative adversarial network for learning normal pattern of public walkways and deviation from the learned pattern is considered as an abnormality. Ameer et al. [2] proposed combination of connected component analysis, histogram of oriented gradient and Gaussian mixture model for non-pedestrian object detection. Khan et al. [21] used Gaussian discriminant model in combination with k-means clustering for classification of events recorded in surveillance videos installed in pedestrian walkways.

Various applications of the anomaly detection have been previously validated with certain benchmark dataset listed in Table 1. These include some of the prominent datasets of UCSD (PED1, PED2), Avenue, Subway Entrance, Subway Exit, and UMN. Most of these datasets have been developed for identification of a particular set of anomalies in specific environments e.g. UCSD is used to identify non-pedestrian entities and walking across walkways, Subway dataset is for identifying walking in wrong direction, non-payments, and loitering, while Avenue, UMN, Hockey Fight, and The Movies datasets have been used for identification of single activity mostly fighting. Moreover, these datasets have been developed with the help of actors and does not provide any real life situation in any of their videos [44].

Table 1
Existing Datasets for Malicious Activity Detection

Name of the Dataset Total Videos Environmental Conditions Identified Anomalous Activity

UCSD PED1 170 Outdoor Non-pedestrian entities, and walking across walkways

UCSD PED2 28 Out door Non-pedestrian entities, and walking across walkways

Subway Entrance 1 Indoor Wrong direction, no payment, and loitering

Subway Exit 1 Indoor Wrong direction, no payment, and loitering

Avenue 37 Outdoor Running, and Throwing object

UMN 5 Indoor, Outdoor Running

Hockey Fight 1000 Playground Fighting

The Movies 200 Indoor, Outdoor Fighting

UCF Crimes 1900 Indoor, Outdoor Abuse, Accident, Arrest, Burglary Explosion, and Fighting etc.

Name of the Dataset	Total Videos	Environmental Conditions	Identified Anomalous Activity
UCSD PED1	170	Outdoor	Non-pedestrian entities, and walking across walkways
UCSD PED2	28	Out door	Non-pedestrian entities, and walking across walkways
Subway Entrance	1	Indoor	Wrong direction, no payment, and loitering
Subway Exit	1	Indoor	Wrong direction, no payment, and loitering
Avenue	37	Outdoor	Running, and Throwing object
UMN	5	Indoor, Outdoor	Running
Hockey Fight	1000	Playground	Fighting
The Movies	200	Indoor, Outdoor	Fighting
UCF Crimes	1900	Indoor, Outdoor	Abuse, Accident, Arrest, Burglary Explosion, and Fighting etc.

We can conclude that the approaches discussed in the previous section mainly targeted contextual anomalies tested over fabricated datasets in which anomalous scenes are acted by the actors.

Considering the importance of detecting real-world anomalies recorded in real-time CCTV footage to assist security agencies. Sultani [44] introduced UCF crimes dataset incorporating 13 real-world anomalies, and proposed multiple instance learning (MIL) for the classification of abnormal activities. The dataset is developed by downloading videos of CCTV recordings from liveleaks and youtube. Each video is labeled according to the type of anomaly recorded in that video. According to sultani the developed dataset is weekly annotated. The proposed algorithm achieved a classification accuracy of 28%, thus, a state of the art solution for detecting real-world abnormality in CCTV recording is still a dream.

2.3 Spatio-temporal analysis

Spatio-temporal analysis is usually desired for identification of a function between spatial and temporal data to affect the performance of any process. While it defines a relation between GPS coordinates and its time instance for activity recognition in [48] and a relation between location and time of the day for prediction of criminal activity in [11], it links the spatial information of each frame of a video sequence to its temporal distribution in [54]. Extraction of useful information from a video sequence relies not only on the visual information spread spatially in each static frame but also on the complex motion information distributed along the continuous sequence of frames. Previously, hand-crafted features are used for obtaining appearance and motion information from video streams [20, 22]. However, these hand-crafted features contain less discriminant information and recent deep supervised and unsupervised features are employed for different applications [44]. More advanced, spatio-temporal convolutional neural networks (CNN) are introduced to learn appearance as well as motion information in video stream [54] which previously gained popularity in area of action recognition [23] and hand gesture recognition [16].

Inspired from the popularity of spatio-temporal CNNs, we in this research propose the use of a similar CNN modified for malicious activity detection.

In this research, we have developed a dataset specifically to identify malicious events in a video stream. For this purpose videos are taken from UCF crimes dataset and are annotated for instance recognition task. Followed by a convolutional neural network used to classify the respective abnormalities. Since the colour information plays the least significant role in instance recognition we have employed gray-scale versions of videos with 2D-CNN networks and achieved better results.

3 Proposed framework

Instance recognition in videos demands analysis of the information spread across spatial and temporal domains of video sequence. We can acquire semantic information of the scene from spatially distributed objects in a single frame, while sequence of such consecutive frames provides the positional changes of objects, hence enabling us to understand the overall activity in the video stream. To perform this task through CNNs, we have designed a framework that takes in samples of the video stream and outputs the activity performed in each sample. The overall architecture has been divided into the modular phases of video pre-processing, feature extraction, and instance recognition. The overall architecture is presented in Fig. 2.

Fig. 2

Overall architectural diagram of the framework developed for malicious instance recognition.

3.1 Video pre-processing

Videos consist of a sequence of stationary image frames. For the interpretation of useful information through convolutional neural network from these videos, we divide them in to blocks of l images. These blocks usually consist of frames that are semantically similar in the context of a single scene in the video. We have used l = 64 in this work and further down-sampled the block by 2 to avoid redundant information. Moreover, for the given input video with spatial resolution of 320 × 240 × 3 we have resized the frames of each block to 212 × 212 × 3, whereas the third dimension points towards the number of channels in each frame. The whole process of block formation could be expressed mathematically in Equation 2. Considering the sequence of discrete frames from video expressed as X [n] we can present the kth block of frames as: $S_{k} = X [n] (u (n - 64 k) - u (n - 64 (k + 1))) .$ (1) Similarly, the kth block after down-sampling is shown as S [n] below: $S [n] = \sum_{k = 0}^{31} X [k] δ [n - 2 k],$ (2) where δ (.) and u (.) in the above equations denote unit impulse and unit step functions, respectively. This procedure has been graphically explained in Fig. 3.

Fig. 3

Video Preprocessor Block.

3.2 Feature extraction

Deep CNNs have been extensively used for feature extraction in various image domains. They proved to extract much more representative features from images compare to previous hand-crafted approaches that relied mostly on local features in images. We have explored two types of deep learning architectures for extracting features from the given block of video stream. Both of these networks have been developed in such a way that take the block of video and extract a single dimensional feature.

3D CNN based Feature Extractor. We believe motion information of the objects to be equally important for instance recognition in addition to the spatial distribution of objects in a given frame. For this purpose, we developed a CNN network comprising of 3D convolution layers that could learn spatial as well as temporal features from the given block of the video sequence. The proposed model is obtained by removing a few convolution layers from the standard C3D network to reduce the network complexity. Our 3D CNN feature extractor uses 5 tiered 3D convolution layers followed by 2 fully connected layers to learn a single-dimensional feature vector. Each 3D convolution layer is followed by a Max-pooling layer with stride 2 × 2 ×2 to transform the object and motion information from spatial and temporal dimension to depth. This transformation leaves us with a frame size of 4 × 4 ×2 × 256. Recently, 3D-CNN gained popularity in the area of action recognition [9 , 25]. Inspired by the performance of 3D-CNN in the field of action recognition, we developed the model in Fig. 5.

Fig. 5

3D Convolutional Neural Network.

$Y [x, y, z] = \sum_{i = 0}^{L - 1} \sum_{j = 0}^{M - 1} \sum_{k = 0}^{N - 1} h [i, j, k] S [x - i, y - j, z - k]$ (3)

2D CNN based Feature Extractor. Network parameters of proposed system are 24,109,437. In order to reduce the network parameter the dimensions of input block is reduced from (112 × 112 × 3 ×32) to (112 × 112 × 32) by converting each frame to gray-scale. Conversion to gray-scale doesn’t affect the system performance in case of activity detection due to the fact that activity detection procedure is not sensitive to the color tone in video frames. The gray-scale block is then fed to 2D CNN of same number of convolution layers and dense layers. Thus reducing the network parameters to 13,722,437. Block diagram of the proposed system based on 2D-CNN is given in Fig. 4. In the figure, the last layer represent the feature vector. The feature vector is then fed to two fully connected layers and an output layer. The number of neurons in fully connected layers and dense layer are same as that in 3D-CNN. For instance S is the block of 32 gray-scale frames of a video, S_k is the kth frame of block S, h is the 2D filter of dimension L × M. The mathematical process for considering temporal as well as special information are shown in Equation 4. $Y [x, y] = \sum_{k = 0}^{31} \sum_{i = 0}^{L - 1} \sum_{j = 0}^{M - 1} h [i, j] S_{k} [x - i, y - j]$ (4)

Fig. 4

2D Convolutional Neural Network.

3.3 Instance recognition

The objective of the research is to detect the combination of frames in a video stream with one of the categories of the volume crimes mentioned earlier on. For this purpose, features of the block acquired in section 3.2 are classified through various classification algorithms including Naive Bays, Decision Tree, Support Vector Machine (SVM), k-Nearest Neighbour, and Softmax. All these classifier are developed in such a way to address both instance detection (binary classification) and instance classification (multi-class classification).

Among all these, softmax classifier has been designed with softmax activation optimizing the binary cross entropy loss and categorical cross entropy loss with detection and classification, respectively. Both of these losses are mathematically presented as: $L_{b} = - \frac{1}{N} \sum_{i = 1}^{N} t_{i} \log (p_{i}) + (1 - t_{i}) \log (1 - p_{i})$ (5) $L_{c} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j}^{C} t_{ij} \log (p_{ij})$ (6) where N represent total number of training examples (blocks), t is target value, p is predicted value, and C denotes the number of classes for multi-class classification.

4 Experimental setup

We have developed a unified framework for the tasks of detection and classification. For the case of instance detection the block is identified as a normal or anomalous event. This is similar to the concept introduced in [15] which considered everything that doesn’t look normal as anomaly. In classification, on the other hand the specific type of activity associated with each of volume crime is identified. Technically, detection performs a binary classification task (0, 1) while in classification we perform a multi-class classification task (0, 1, 2, 3, 4) with the same framework given in Fig. 2. Both of these task have been validated through the dataset developed specially for malicious instance recognition.

4.1 Dataset

This research work is focused on the anomaly detection in safe-city environment, this is why the subset of a very recent dataset (UCF crimes) developed for real-world anomaly detection in surveillance videos has been considered. The dataset consists of CCTV footage’s of real world anomalies from Liveleaks, and Youtube including 13 real-world anomalies containing 1900 videos spanned over 128 hours. For this research a subset of four most crucial anomalies (shooting, assault, fighting, and vandalism) are annotated for in-video event recognition. This is carried out by specifically separating normal frames from the ones that contain anomalous activity. The process has been demonstrated in Fig. 6. Previously, it was very difficult to use the video labelled as assault, it was observed that videos labeled as (Assault 039) contain (43.8%) frames belonging to normal activity and the rest belonging to the actually assault. Each video in our dataset consists of frame-level labels for its class annotated by three skilled annotators through visually inspecting each video stream.

Fig. 6

Annotation of video sample labeled as Assault in original dataset.

4.2 Experimentation

We have conducted experiments for detection and classification using 3D-CNN and 2D-CNN features with various classifiers explained in section 3.3. Hyper parameters setting for 2D and 3D CNNs are listed in Table 2. We have performed our experiments on Intel Core-i5 with 8 Gb RAM and NVidia GTX 1050Ti Graphical Processing Unit. In each experiment Stochastic Gradient Descent optimizer is used for learning weights.

Table 2
Hyper parameter choice for 2D-CNN and 3D-CNN

Parameters 2D-CNN 3D-CNN

No. of Epochs 140 70

Initial Learning rate 0.001 0.001

Momentum 0.9 0.9

Kernal Size (3,3) (3,3,3)

Pooling window (2,2) (2,2,2)

Parameters	2D-CNN	3D-CNN
No. of Epochs	140	70
Initial Learning rate	0.001	0.001
Momentum	0.9	0.9
Kernal Size	(3,3)	(3,3,3)
Pooling window	(2,2)	(2,2,2)

5 Results and analysis

result

We have conducted numerous experiments for detection and classification using features extracted from 2D and 3D-CNNs. The performance of the mentioned model is evaluated based on the performance metric like AUC, accuracy, and false-positive rate. This section provides a performance comparison of our proposed system. Table 3 summarizes the results.

Table 3
Comparison of the features from 2D CNN and 3D CNN with other classifiers

Classifier 2D CNN Detection 3D CNN Detection 2D CNN Classification 3D CNN Classification

Naive Bays 89.61 89.32 86.47 87.29

Decision Tree 89.49 89.13 75.24 75.96

Support Vector Machine 88.28 88.73 75.24 75.96

k-Nearest Neighbour 89.49 90.72 84.27 84.53

Softmax Classifier 91.24 90.57 91.73 91.51

Classifier	2D CNN Detection	3D CNN Detection	2D CNN Classification	3D CNN Classification
Naive Bays	89.61	89.32	86.47	87.29
Decision Tree	89.49	89.13	75.24	75.96
Support Vector Machine	88.28	88.73	75.24	75.96
k-Nearest Neighbour	89.49	90.72	84.27	84.53
Softmax Classifier	91.24	90.57	91.73	91.51

5.1 Anomalous event detection

From the results on event detection using different features it was observed that 2D CNN outperforms 3D CNN achieving the overall detection accuracy of 91.24%. Although, all the classifiers equally performed well for both type of features, however, Softmax classifier outperforms the rest in detecting malicious video events as shown in Table 3. Similar patterns have been observed in confusion matrices and t-SNE plots of the detection process as shown in Fig. 7 8.

Fig. 7

Confusion Matrix for 2D features and 3D features using Softmax classifier.

Fig. 8

t-SNE plots for 2D features and 3D features using Softmax classifier.

Table 6

Extended analysis on Anomalous Event Classification

Technique	Activity	Precision	Recall	F1-Score
	Normal	0.949	0.949	0.949
	Assault	0.857	0.929	0.892
3D CNN	Fighting	0.939	0.8549	0.895
	Shooting	0.839	0.908	0.872
	Vandalism	0.904	0.911	0.907
	Normal	0.940	0.956	0.948
	Assault	0.868	0.886	0.877
2D CNN	Fighting	0.938	0.894	0.915
	Shooting	0.816	0.889	0.851
	Vandalism	0.896	0.846	0.870

5.2 Anomalous event classification

For classification of the event among one of assault, fighting, shooting, vandalism, and normal, the features from 2D and 3D CNN have been extracted in similar manner and evaluated with various classifiers as described in Section 3.3. It is observed that the performance of Softmax classifier combined with 2D features is much better in comparison to the rest of classifiers. Overall classification accuracy of 91.73%has been achieved for this specific task. Even though, 3D features also performed well in classification; however, the number of network parameters in 2D CNN are much less as compared to 3D-CNN.

5.2.1 Extended analysis on anomalous event classification

We also presented the results in terms of detail performance metrics including Precision, Recall, and F1-Score for each class of anomalous events. Upon observation of Table 6, we came to the conclusion that 2D-CNN performs better for each class in comparison to the 3D-CNN. A similar phenomenon could be observed in the recall score for the shooting as 0.846 and 0.916 for 3D-CNN and 2D-CNN, respectively.

Table 5 contain confusion matrices, whereas Fig. 9 shows the t-SNE plots for event classification using 2D and 3D features with Softmax classifier. It should be noted that visibly separable clusters could be seen in t-SNE plots which validates the accuracy achieved for the given task of classification.

Table 5
Confusion Matrix for 2D features and 3D features using Softmax classifier

Technique Actual Label Predicted Label

Normal Assault Fighting Shooting Vandalism

3D CNN Normal 0.95 0.01 0 0.02 0.02

Assault 0.01 0.93 0.04 0.01 0.01

Fighting 0.05 0.06 0.85 0.03 0.01

Shooting 0.03 0.02 0.04 0.90 0.01

Vandalism 0.05 0.01 0.01 0.02 0.91

2D CNN Normal 0.96 0.01 0 0.01 0.02

Assault 0.03 0.89 0.04 0.02 0.02

Fighting 0.03 0.04 0.90 0.02 0.01

Shooting 0.04 0.03 0.03 0.89 0.01

Vandalism 0.08 0.01 0.02 0.04 0.85

Technique	Actual Label	Predicted Label
2D CNN	Normal	0.96	0.01	0	0.01	0.02
	Assault	0.03	0.89	0.04	0.02	0.02
	Fighting	0.03	0.04	0.90	0.02	0.01
	Shooting	0.04	0.03	0.03	0.89	0.01
	Vandalism	0.08	0.01	0.02	0.04	0.85

Fig. 9

t-SNE plots for Classification Task for 2D features and 3D features using Softmax classifier.

5.3 Comparison with the state-of-the-art

We have also compared our approaches with the state-of-the-art techniques using 14 UCF crimes dataset. It is observed from Table 4 that our approach is performing far better on 5 classes as compared to the rest of the techniques using MIL and GDA techniques. Moreover, our approach achieves the least false positive rate of 7%for our 3D CNN.

Table 4
Comparison with state-of-the-art techniques

Technique Accuracy AUC False Positive

MIL[44] 53.4% 79.2% –

GDA[21] – .30% –

VGG16 89.1% 92.4% 14%

Resnet3D 90.1% 94.4% 12%

2D CNN 91.2% 95.2% 8%

3D CNN 90.7% 94.7% 7%

Technique	Accuracy	AUC	False Positive
MIL[44]	53.4%	79.2%	–
GDA[21]	–	.30%	–
VGG16	89.1%	92.4%	14%
Resnet3D	90.1%	94.4%	12%
2D CNN	91.2%	95.2%	8%
3D CNN	90.7%	94.7%	7%

It is observed that the proposed 2D model outperforms the performance of 3D model in case of the activity detection and manage to achieve false positive rate of 0.8. Furthermore, the model is suitable for real time application due to its low false positive rate and high frame processing rate that is 1000 frames/sec.

6 Conclusion and future work

Lack of implementable software solutions for identification of real-world malicious activities from video streams in safe city environment require blend of computer vision and machine learning algorithms. In this regard an optimal solution for the analysis of temporal frames extracted from CCTV recordings is proposed. Our proposed models managed to achieve high accuracy for not only the identification of malicious events but also classification of real world volume crimes including assault, fighting, shooting, and vandalism in a video sequence. Furthermore, Our models are also suitable for real-time applications due to their high frame processing rate and low false alarm rate, with high classification accuracy of 91.2%and AUC of 95.2%on five classes.

The system can further be modified for other classes of crimes including but not limited to burglary, riots, attempted murder, arson, explosion, robbery, theft, and arrest etc. In order to get a unified framework for the detection of multiple malicious activities recorded by a CCTV camera, we need to train the same system with the data for the above mentioned events.

References

Ainsworth

, Buyer beware, Security Oz 19 (2002), 18–26.

Amraee

, Vafaei

, Jamshidi

and Adibi

, Anomaly detection and localization in crowded scenes using connected component analysis, Multimedia Tools and Applications 77(12) (2018), 14767–14782.

Boiman

and Irani

, Detecting irregularities in images and in video, International Journal of Computer Vision 74(1) (2007), 17–31.

Calderara

, Heinemann

, Prati

, Cucchiara

and Tishby

, Detecting anomalies in people’s trajectories using spectral graph analysis, Computer Vision and Image Understanding 115(8) (2011), 1099–1111.

Chen

, Wactlar

, Chen

M.-y.

, Gao

, Bharucha

and Hauptmann

, Recognition of aggressive human behavior using binary local motion descriptors, In 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 5238–5241. IEEE, 2008.

Chen

, Li

, Wang

, Hu

and Ye

, Vision-based fall event detection in complex background using attention guided bi-directional lstm, IEEE Access 8 (2020), 161337–161348.

Cong

, Yuan

and Liu

, Sparse reconstruction cost for abnormal event detection, InCVPR2011, pages 3449–3456. IEEE, 2011.

Cong

, Yuan

and Tang

, Video anomaly search in crowded scenes via spatio-temporal motion context, IEEE Transactions on Information Forensics and Security, 8(10) (2013), 1590–1599.

Cronje

, Human action recognition with 3D convolutional neural networks, PhD thesis, University of Cape Town, 2015.

10.

De Souza

F.D.M.

, Chavez

G.C.

, do Valle

E.A.

Jr , de Araújo

A.A.

, Violence detection in video using spatio-temporal features, In 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images, pages 224–230. IEEE, 2010.

11.

Duan

, Hu

, Cheng

, Zhu

and Gao

, Deep convolutional neural networks for spatiotemporal crime prediction, In Proceedings of the International Conference on Information and Knowledge Engineering (IKE), pages 61ȃ67. The Steering Committee of The World Congress in Computer Science, Computer..., 2017.

12.

Ermis

E.B.

, Saligrama

, Jodoin

P.-M.

and Konrad

, Motion segmentation and abnormal behavior detection via behavior clustering, In 2008 15th IEEE International Conference on Image Processing, pages 769–772. IEEE, 2008.

13.

Eyben

, Weninger

, Lehment

, Schuller

and Rigoll

, Affective video retrieval: Violence detection in hollywood movies by large-scale segmental feature extraction, PloS One 8(12) (2013), e78506.

14.

Fenil

, Manogaran

, Vivekananda

G.N.

, Thanjaivadivel

, Jeeva

, Ahilan

, et al., Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional lstm, Computer Networks 151 (2019), 191–200.

15.

Gong

, Liu

, Le

, Saha Moussa

, Mansour

, Venkatesh

and van den Hengel

, Memorizing normality to detect anomaly: Memoryaugmented deep autoencoder for unsupervised anomaly detection, arXiv preprint arXiv:1904.02639, 2019.

16.

Hakim

N.L.

, Shih

T.K.

, Arachchi

, Priyanwada

, Aditya

, Chen

Y.-C.

and Lin

C.-Y.

, Dynamic hand gesture recognition using 3dcnn and lstm with fsm context-aware model, Sensors 19(24) (2019), 5429.

17.

Hassner

, Itcher

and Kliper-Gross

, Violent flows: Real-time detection of violent crowdbehavior, In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–6. IEEE, 2012.

18.

Jahanjoo

, Naderan

and Rashti

M.J.

, Detection and multi-class classification of falling in elderly people by deep belief network algorithms, Journal of Ambient Intelligence and Humanized Computing, pages 1–21, 2020.

19.

Jiang

, Yuan

, Tsaftaris

S.A.

and Katsaggelos

A.K.

, Anomalous video event detection using spatiotemporal context, Computer Vision and Image Understanding 115(3) (2011), 323–333.

20.

Kaltsa

, Briassouli

, Kompatsiaris

, Hadjileontiadis

L.J.

and Strintzis

M.G.

, Swarm intelligence for detecting interesting events in crowded environments, IEEE Transactions on Image Processing 24(7) (2015), 2153–2166.

21.

Khan

M.U.K.

, Park

H.-S.

and Kyung

C.-M.

, Rejecting motion outliers for efficient crowd anomaly detection, IEEE Transactions on Information Forensics and Security 14(2) (2018), 541–556.

22.

, Mahadevan

and Vasconcelos

, Anomaly detection and localization in crowded scenes, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1) (2013), 18–32.

23.

Lima

, Fernandes

and Barros

, Human action recognition with 3d convolutional neural network, In 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pages 1–6. IEEE, 2017.

24.

Liu

, Luo

, Li

, Zhao

and Gao

, Margin learning embedded prediction for video anomaly detection with a few anomalies, In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3023–3030. AAAI Press, 2019.

25.

Liu

, Zhang

and Tian

, 3d-based deep convolutional neural network for action recognition with depth sequences, Image and Vision Computing 55 (2016), 93–100.

26.

, Wu

, Feng

and Song

, Deep learning for fall detection: Three-dimensional cnn combined with lstm on video kinem atic data, IEEE Journal of Biomedical and Health Informatics 23(1) (2018), 314–323.

27.

, Wang

, Xue

, Zhou

, Ji

and Li

, Depth-based human fall detection via shape features and improved extreme learning machine, IEEE Journal of Biomedical and Health Informatics 18(6) (2014), 1915–1922.

28.

Mahadevan

, Li

, Bhalodia

and Vasconcelos

, Anomaly detection in crowded scenes, In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1975–1981. IEEE, 2010.

29.

Morris

B.T.

and Trivedi

M.M.

, Trajectory learning for activity understanding: Unsupervised, multilevel, and long-term adaptive approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11) (2011), 2287–2301.

30.

, Cao

and Jin

, Violent scene detection using convolutional neural networks and deep audio features, In Chinese Conference on Pattern Recognition, pages 451–463. Springer, 2016.

31.

Nam

, Alghoniemy

and Tewfik

A.H.

, Audiovisual content-based violent scene characterization, In Proceedings 1998 International Conference on Image Processing, ICIP98 (Cat. No. 98CB36269), volume 1, pages 353–357. IEEE, 1998.

32.

Nguyen

N.T.

, Phung

D.Q.

, Venkatesh

and Bui

, Learning and detecting activities from movement trajectories using the hierarchical hidden markov model, In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 955–960. IEEE, 2005.

33.

Nieto

, Varona

, Senderos

, Leskovsky

and Garcia

, Real-time video analytics for petty crime detection, 2016.

34.

Nievas

E.B.

, Suarez

O.D.

, García

G.B.

and Sukthankar

, Violence detection in video using computer vision techniques, In International conference on Computer analysis of images and patterns, pages 332–339. Springer, 2011.

35.

Pannurat

, Thiemjarus

and Nantajeewarawat

, Automatic fall monitoring: a review, Sensors 14(7) (2014), 12900–12936.

36.

pawan , Urbanization and its causes and effects: A review, International Journal of Research and Scientific Innovation 31 (2016), 110–112.

37.

Penet

, Demarty

C.-H.

, Gravier

and Gros

, Technicolor and inria/irisa at mediaeval 2011: learning temporal modality integration with bayesian networks, 2011.

38.

Rabiee

, Haddadnia

, Mousavi

, Kalantarzadeh

, Nabi

and Murino

, Novel dataset for fine-grained abnormal behavior understanding in crowd, In 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 95–101. IEEE, 2016.

39.

Ravanbakhsh

, Nabi

, Sangineto

, Marcenaro

, Regazzoni

and Sebe

, Abnormal event detection in videos using generative adversarial nets, In 2017 IEEE International Conference on Image Processing (ICIP), pages 1577–1581. IEEE, 2017.

40.

Ravanbakhsh

, Sangineto

, Nabi

and Sebe

, Training adversarial discriminators for cross-channel abnormal event detection in crowds, In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1896–1904. IEEE, 2019.

41.

Reddy

, Sanderson

and Lovell

B.C.

, Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture, In CVPR 2011 WORK-SHOPS, pages 55–61. IEEE, 2011.

42.

Stone

E.E.

and Skubic

, Fall detection in homes of older adults using the microsoft kinect, IEEE Journal of Biomedical and Health Informatics 19(1) (2014), 290–301.

43.

Sudhakaran

and Lanz

, Learning to detect violent videos using convolutional long short-term memory, In 2017 14th IEEE International Conference on AdvancedVideo and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017.

44.

Sultani

, Chen

and Shah

, Real-world anomaly detection in surveillance videos, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2018.

45.

Tahboub

, Reibman

A.R.

and Delp

E.J.

, Accuracy prediction for pedestrian detection, In 2017 IEEE International Conference on Image Processing (ICIP), pages 4192– 4196. IEEE, 2017.

46.

Ullah

F.U.M.

, Ullah

, Muhammad

, Haq

I.U.

and Baik

S.W.

, Violence detection using spatiotemporal features with 3d convolutional neural network, Sensors 19(11) (2019), 2472.

47.

Vilamala

M.R.

, Hiley

, Hicks

and Preece

, Cerutti

, A pilot study on detecting violence in videos fusing proxy models, 2019.

48.

Wang

, Jiao

, Bao

, He

, Liu

and Liu

, Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4006–4015, 2019.

49.

Wang

and Snoussi

, Detection of abnormal visual events via global optical flow orientation histogram, IEEE Transactions on Information Forensics and Security 9(6) (2014), 988–998.

50.

, Moore

B.E.

and Shah

, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes, In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2054–2060. IEEE, 2010.

51.

Xiao

, Zhang

and Zha

, Learning to detect anomalies in surveillance video, IEEE Signal Processing Letters 22(9) (2015), 1477–1481.

52.

Zhang

, Yi

, Wang

and Yu

, Mic-tju at mediaeval violent scenes detection (vsd) 2014. In MediaEval, 2014.

53.

Zhang

, Tian

and Capezuti

, Privacy preserving automatic fall detection for elderly using rgbd cameras, In International Conference on Computers for Handicapped Persons, pages 625–633. Springer, 2012.

54.

Zhou

, Shen

, Zeng

, Fang

, Wei

and Zhang

, Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes, Signal Processing: Image Communication 47 (2016), 358–368.

Deep Vigilante: A deep learning network for real-world crime detection

Abstract

Keywords

1 Introduction

2.1 Context sensitive anomalies

2.2 Established anomalies

3 Proposed framework

4.1 Dataset

Table 2 Hyper parameter choice for 2D-CNN and 3D-CNN Parameters 2D-CNN 3D-CNN No. of Epochs 140 70 Initial Learning rate 0.001 0.001 Momentum 0.9 0.9 Kernal Size (3,3) (3,3,3) Pooling window (2,2) (2,2,2)

5.2.1 Extended analysis on anomalous event classification

Table 4 Comparison with state-of-the-art techniques Technique Accuracy AUC False Positive MIL[44] 53.4% 79.2% – GDA[21] – .30% – VGG16 89.1% 92.4% 14% Resnet3D 90.1% 94.4% 12% 2D CNN 91.2% 95.2% 8% 3D CNN 90.7% 94.7% 7%

References

Table 2
Hyper parameter choice for 2D-CNN and 3D-CNN

Parameters 2D-CNN 3D-CNN

No. of Epochs 140 70

Initial Learning rate 0.001 0.001

Momentum 0.9 0.9

Kernal Size (3,3) (3,3,3)

Pooling window (2,2) (2,2,2)

Table 4
Comparison with state-of-the-art techniques

Technique Accuracy AUC False Positive

MIL[44] 53.4% 79.2% –

GDA[21] – .30% –

VGG16 89.1% 92.4% 14%

Resnet3D 90.1% 94.4% 12%

2D CNN 91.2% 95.2% 8%

3D CNN 90.7% 94.7% 7%