Abstract
The importance of the surveillance is increasing every day. Surveillance is monitoring of activities, behavior and other changing information. An intelligent automatic system to detect behavior of the human is very important in public places. For this necessity, a framework is proposed to detect suspicious human behavior as well as tracking of human who is doing some unusual activity such as fighting and threatening actions and also distinguishing the human normal activities from the suspicious behavior. The human activity is recognized by extracting the features using the convolution neural network (CNN) on the extracted optical flow slices and pre-training the activities based on the real-time activities. The obtained learned feature creates a score for each input which is used to predict the type of activity and it is classified using multi-class support vector machine (MSVM). This improved design will provide better surveillance system than existing. Such system can be used in public places like shopping mall, railway station or in a closed environment such as ATM where security is the prime concern. The performance of the system is evaluated, by using different standard datasets having different objects and achieved 95% performance as explained in experimental analysis.
Keywords
Introduction
In modern years the automatic detection of suspicious activity in video surveillance is a challenging task in long time video streams. The visual surveillance system is predominantly facing problems in the detection of suspicious activity both private and public sectors. The long-time video surveillance system has to handle the complexity of scenes and false nature of unusual activity. Recognizing such activity from the video is a challenging task and has received more attention in the computer vision. Analyzing the human activity is not an easy task as it is not just recognizing the pattern rather it is about recognizing the motion of different parts of the body. Hence, it is crucial in understanding the human behavior, which is important in various fields such as health care surveillance etc., video processing has witnessed more advancement in deep learning with easy recognition of human activity with learned features.
Deep learning has become popular in the last few years for human activity detection, which has ability to learn the feature of human action automatically and classify its type based on training process [1]. One of the popular methods in deep learning is a convolution neural network, which is a feed forward artificial neural network for analyzing visual imagery, compared to traditional feed forward networks CNN. CNN has less relations and parameters, so very easy to train and test with a large set of images [2]. CNN has three layers an input, an output and hidden layers in multiple numbers. The hidden layer includes normalized, convolution, fully connected layers. CNN can be used for many applications with the training data. The convolution neural network, usually require large amounts of data for training which avoid over fitting of images. The main advantage of using the CNN approach is to recognize high-level activities with complex structures, increased performance in extracting features and classifies them. The main objective of the system is to detect and warn the abnormal or unusual activities held at different locations such as small and large areas which deal with varying situations.
The proposed work mainly aimed at detecting abnormal activities by creating an automatic system for human activity recognition from the video surveillance which is captured from both indoor and outdoor environments. The convolution neural network is used for training the images and to get the correct predicted value, involves in the tracking of suspicious activity. In preprocessing the input videos are converted into sequences of frames and the motion of the luminance pattern which is extracted by tracking results in optical slices, from each slice the region of interest (ROI) has been extracted from the consecutive frame. After extracting, images stored in the image datastore for training and testing the data set are collected. The extracted region is resized and given as input to CNN for learning the feature and to predict the labelled output. The outcome feature map is used to map the feature class based on the score generated and then the predicted values are trained with the MSVM which gives more accurate values than other existing classifiers, the high-performance is accelerated by using GPU setup to train the proposed model.
Related work
In last decade, many researches are based on human and computer vision. Enormous algorithms that detect the objects that are moving in the video may not show the robust performance, which means that there is a need of improvement for detection and tracking algorithms. Peipei Zhou et al. in their work they proposed a violent detection in video surveillance like railway station, prison or psychiatric centres. This work with new input modality and acceleration field to extract the motion attributes. The first process is the RGB based frame extraction from video [3]. Secondly, the optical flow field is computed using the consecutive frames and acceleration field is obtained. This method does not improve more in performance and also gives some false activities. Tao Zhang et al. proposed the idea of motion-weber local descriptors (WLD) with two key improvements [4]. This uses WLD for low-level image appearance details, and it extends to contiguous descriptors. It is efficient for violence detection, but some drawback found is the error rate, which is not minimized and some false alarms are also found in detecting the waving flags and clapping hands. Kai-Wen Cheng et al. approach involves in detecting local and global anomalies, to extract normal connections from trained videos to find the geometric associations of the closest sparse interest points of spatial and temporal factors and to analyze high-level features are extracted by intense sampling [5]. Haidar Sharif et al. presented an entropy approach for abnormal activity detection in video. The entropy based approach with the estimation of statistical treatment of spatio-temporal information about a set of interesting points of computing the degree of uncertainty of both directions and displacement, which provides the state of abnormality [6]. Some advantages of proposed methods classify the degree of randomness of displacement with interesting points. The basic drawback is since the experiment is based on single fixed camera it cannot provide two different viewpoints. Wei. Niu et al. proposed human activity detection in video for activity detection and tracking using the low-level motion detection algorithm [7].The advantage of this work, several cameras used and combine such as the 2d-dimensional position estimate from the several sensors. The disadvantage occurs when the object is far-away from the camera it fails to detect that activity accurately. Oluwatoyin P. Popoola et al. proposed work focuses on contextual abnormal behavior detection and proposes the anomaly detection algorithm [8]. The density of moving targets in a scene and defining of abnormal behavior found with the characteristics of scenes. They proposed a binary classification of normal versus abnormal activities. This works fails to work in cluttered environments with so many moving objects. Sugla Vinaayagan Rajenderana et al. proposed suspicious human- movement detection algorithms for detecting the suspicious human movement in the real time video [9]. Human-motion is extracted with the guassian model algorithm because it is effective for the illumination problem. The framework used for recognizing the suspicious activity called as grammar-based approach. This approach ensures high precision and high accuracy in real-time performances, although the results show some inconsistency in detection due to illumination changes. M. Sabokou et al. work deals with the detection of abnormal crowd behavior [10]. It shows a method for identification and anomaly localization in video frames by using temporal data and neural networks. This work does not test with the real time dataset also it has been trained only for the collected data set. Henri Bouma et.al, work focused on the crowded shopping mall for pickpot detection [11]. The cross validation performed by the classifier to train and evaluate the system, while improving the cross validation accuracy level decreases. Lin Wu et al. presented a deep recurrent convolution video based identification system that combines all convolution layers such as recurrent layer and temporal pooling to produce an improved video representation [12]. Features are summarized by temporal-pooling to produce an overall representation of features. Spatio-temporal appearance model treats video as 3D volume and extract local – spatio temporal feature which is used for person-identification in cluttered background and occlusion region. Performance is done with the CPU, which takes more time and less accuracy. P.A Dhulekar et al. proposed a motion estimation for human activity in surveillance [13] which contains a video acquisition device (webcam) for capturing the video and then a sequence of frame has been generated from the video with the fixed frame as 240. Once the motion is identified, suspicious object gets detected by matching with the template and buzzer is turned on. The drawback of this method is that if the activity is not present in the template it returns nothing or unmatched activity. Prof. Jitendra Musale et al. analyzed the suspicious motion detection and tracking behavior of humans and objects with fire detection captured on CCTV cameras [14]. This work mainly in an outdoor location like an entry, exits of the building, corridors, etc; this method which is used for off-line process as well as on-line process for detecting suspicious actions, it includes mischievous walking, group fighting, terrorists attacks. Aniket Bhondave et al. proposed a suspicious object detection using back- tracking technique for detecting the abandoned luggage left in the public places such as public roads, railway station and airports [15]. This technique can be used in public places for making secure environment where the abandoned object may have weapon bombs and some other harmful equipment. Yaxiang Fan et al. proposed a real-time capturing of falling objects using deep neural network that are probable occurs which is achieved by a effectiveness of deep networks [16]. Falling activities are categorized into four phases of stand, falling, fallen, and not moving. In this work videos are manually trimmed and converted into dynamic images which give less accuracy in detection. Medev ravanbaksh et.al, proposed action recognition with image based CNN feature [17]. CNN is used to detect the actions by training with largest of samples; it also eliminates the designing of handcrafted features. To detect the temporal feature a hierarchical method has been implemented to capture the complex actions. Shiliang Pu et al. analyses a novel concept to estimate density of crowd in camera observation based on the convolution neural-network [18]. It imports the crowd density estimation and secondly is to annotate the images for better evaluation in the accuracy level. F.M. Rueda et al. approach is based on human activity recognition here he clearly explain the usage of CNN [21] and also uses the sensors for detection but our approach concentrate on the usage of the CNN. From the above approaches of existing models shows better performance in their own experiments, but they have some limitations, the proposed model overcomes all the problems. The challenge is in tracking efficiency, training the CNN, classifiers and detection rate.
This paper, proposed an intelligent system for detecting the suspicious activity of human in public places using Optical flow for obtaining the motion vector pattern in the flow and deep convolution neural network is used for training the images and to get the correct predicted value. The input videos are converted to sequences of frames and ROI has been extracted from the consecutive frame, after extracting images has been stored in the datastore for training and testing the dataset collected.
Proposed system
In machine learning numerous techniques that can learn features may directly through the data, such as text, images, sound, etc., sometimes deep based learning models exceeds the human level performance in the object class. The proposed model is trained with a large set of label data with varying categories the convolutional neural network architectures that contain many layers, which includes convolutional layers, activation layer, pooling layers. Training these models is computationally intensive, so this can accelerate by training using a high-performance GPU, but the proposed model is trained using CPUs with various parameters to match with the GPU accuracy level, the architecture diagram of the proposed model with the CNN feature extraction layer explains about the automatic learning of features with the help of CNN layers. The proposed framework architecture is shown in Fig. 1 and flow of work given in the below sections.

Suspicious activity detection.
This framework started by giving input to the system, a video file. Then input is split into frames and by processing in a successive manner. To remove noise the input video is pre-processed the conversion of RGB frame to Gray scale frame. The background is obtained by the subtraction of successive frames.
Tracking
Although there are different approaches to the tracking problem, their applications are restricted to scenes with few and simple detectable constituents. Generally, the application of conventional tracking algorithms on videos of high density crowds is challenging and is encountered by many issues. This proposed model involves in tracking the motion vectors of the activities for this optical technique is used. Optical flow is the apparent velocity distribution of the motion of the luminance pattern in the image. It is a clear pattern of movement of the objects between the successive frames resulting from the motion of the body or camera. It is a two-dimensional field where every vector is a vector of displacement shows how a point moves from the initial frame to the next frame. It is based on two points those are (a) object pixel density that does not alter among the successive frames and (b) the adjacent pixels are similar motion.
The basic formula derivation for optical flow is
Where Ii(m,n,t) is a pixel in the first frame(i = 1 to n), it shifts by distance (dm,dn) in the next successive frame after the dt time. At that point take the Taylor arrangement estimate of right hand side, evacuate regular terms and partition by dt to get the condition.
This work uses the optical flow to calculate the magnitude of the moving pixels and its direction in all the frames the bounding box is drawn to the tracked region.
Initially the input video from the dataset is loaded into the program and the optical flow is extracted for block computation. From the extracted flow magnitude vector is computed for each block in a frame, then the magnitude of the block is set as a threshold for that block. The distance between the centroid of the block is calculated. In training the network using trains network function for deep learning analysis. Train network with the parameter such as imds, layers, and options trains a network for image classification problems. Imds (image datastore) store the input image data in a cache for easy retrieval, layers define the network architecture, and options define the training options. A stochastic gradient descent using a set of options to train a network and in every epoch the initial learning rate is specified with a value. The learning rate of default value 0.001 is used for generalization error minimization. It is used for training consisting of Initial Learn Rate and a positive scalar. If low learning rate then training takes a long time as if it is high the work is completed effective manner with short span of time. In training there are 20 numbers of iterations and 125 observations at every cycle of iteration at each batch. A mini-batch evaluates loss values and the weight is updated. Plot the training progress during training by using the plot function. If GPU’s used for training, time will be reduced and fast process the network, for using this execution environment should be set as the GPU or else in the absence it automatically assigns the CPU as default function.
Feature extraction
The Convolution neural network is an efficient action recognition algorithm which is broadly used in image and pattern recognition. It is used because of simple structure and it needs less training parameters. The CNN is as same as other neural network having neurons in multiple numbers with weights and bias function for metrics. Every neuron performs dot product by receiving input, but it is non-linear. The network with different function for processing the input and generate the output with maximum accuracy, even though it has some function loss, e.g. the last layer is fully connected called SoftMax. Generally, CNN has two phases; the first phase is the feature extraction layer which is used for extracting the local and global features of the given input. The neighbourhood features are extracted from the neighbourhood receptive field which is connected to the input neuron. The other feature is the feature map layer is used to map the class based on the score generated. Each feature map plane uses a sigmoid function as activation function of the convolution neural network. CNN comes across with different layers for training and testing; these layers filter the image into small patches and process the image for image mapping to the specified output. The feature extraction by the CNN algorithm with two main processes is the convolution, the sampling and feature mapping with the fully connected layer.
The feature extraction is carried out by passing the motion vectors of the extracted region into the three layers of the CNN. The procedure as follows:
i) Conv Layer:
Conv Layer (convolution layer) calculates the volume output by performing dot function operator between all image patches and filters. Output depends on the filter size. Neurons in the output volume are specified by three types of hyper parameter are depth, stride and zero-padding. Figure 2 shows how this layer filters the image.
Accepts a volume of size Wi2×Hi1×Di1
Wi2 - Input width
Hi1 - Input height
Di1 - Dimension
Requires four hyper parameters:
‘K’ Filters,
‘F’ spatial extent,
‘S’ stride,
‘P’ zero padding.
Generates a volume of size Wi3×Hi2×Di2
Symmetrically Width and height are computed
Wi3 – output width
Hi2 – output height
Di2 – output channel
K weights and K biases proposes weights per filter,
In the output, the depth slice (size Wi3× Hi2Wi3×Hi2) is convolution of dd-th filter result.

Convolution layer.
ii) RLU (Rectified linear unit) layer:
RLU refers deploys activation task needed for the outputs of the CNN neurons. Mathematically described as:
Unfortunately, RLU function is hard to use with back propagation training. Instead, Soft plus function is used in practically:
The derivative of this function is the sigmoid function, as given in a prior blog post.
iii) Pooling layer:
Pooling layer periodically inserts into the network. It is used to reduce the size of volume for making the process fast and also prevents over fitting. There are two types of pooling function max pooling and average pooling, here we used is max pooling, which get the maximum value from the stride and produce the patch image shown in Fig. 3.

Pooling layer.
Accepts a volume of size Wi2×Hi1×Di1 Wi2– convolved input width Hi1 – convolved input height Di1 – filter size
Requires two hyper parameters: ‘F’ spatial extent, ‘S’ is stride,
Produces size, volume Wi3×Hi2×Di2 where:
Wi3=(Wi2–F)/S+1 Hi2=(Hi1–F)/S+1 Di2 = Di1 Wi3 – pooled width Hi2– pooled height Di2 – filter size
The resultant values are zero’s because it works outs a fixed function of the input.
iv) Fully connected layer
The final layer of CNN is a fully connected layer; it is used to classify the significant images and predicts the output. The other way you can represent the output just by showing the SoftMax approach. Consider an example of predicting the dog image from the deep learning approach; it takes the high-level dog features similar to paw and legs. In the same way, the program is predicting the action for the given sample image by training the images. A fully connected layer takes the high-level features and map that feature using the activation function and to match with the particular class based on the weights.
The algorithm 1 for the proposed model to detect Human suspicious activity is given below:
Input Data: Input images with observations
{x1, x2,… , xT}, and the corresponding
estimated responses {Y1, Y2,,YT}.
Output: Trained CNN weight matrices and scores
Step 1: Initialization and pre-processing
// Input video
For each video frame f = 1 to N
{
Read frame and extract the ROI
//Optical flow slice
do – rand(f)
}
Step 2: Processing the image slice in all layers
While
// r-epochs
While
// t- iterations
the input data is loaded into the neural network toolbox.
//
While (input)
{
Input ← resize of image to 32 *32
}
Input Size: 32 *32 * 1
Specify number of categories
//
for m = 1 to M do
Propagate through the network
with layers
for K = 1 first layer convolves the image
pattern
find error
for layers L-1 to I do
find error factor for the layers
end
resolution is reduced to half of the size
using max pooling
find Δw
update weights and biases
w(new)=w(old)+Δw
end
end
//
for (map index = 1 to number of
categories)
{
Layer (map index)=I *k (map index)
}
Step 3: Training Phase
// training options
Update network parameter θ
using gradient function // Eq(6)
Max_epochs = 20
Mini batch size = 125
Specify the execution environment as
GPU or CPU
Train network (data, training-
layers, training options,)
Step 4: To classify the class
Classify the predicted image and
generate Score
Order the training samples by the value
Test the trained image by classifier
If k1 value is same from same class c1 then
//k1-first dataset
C1-class1
Return class 1
End if
Repeat for all the samples in the list
End
End
The algorithm of the proposed method is to detect the suspicious activity by recognizing the human activity from the crowd. The experimental setup is done with the below-mentioned dataset where the data collected and trained with the convolution neural network. The training process is carried out across all different layers of CNN as mentioned above the Conv-layer computes volume based on the filter size and channel and the convolved image is applied to the activation function to minimize the error rate. After that the output is given to pooling layer which reduces the size of the image and computes the average maximum value for the input. Pooling layer reduces the size to train the large number of images with less processing time. The final layer is fully connected layers are receiving input from the pooling layer and classify the images based on the score predicted.
The process of the algorithm can be explained in detail as follows: Input: Consider the image frame of sample size (32*32*3) holds the image pixels with image width, image height and the 3 colors of a image Red, Green, Blue pixels. Conv-layer: It computes neuron output values which are connected to input local regions, each computes the volumes by the product of weights results in volume (32*32*12) when 12 filters are used. RLU layer: It uses a function i.e, max (0, x) that thresholds at zero. The size of the volume is not changed 32*32*12. Pooling Layer: It reduces the size of volume to (16*16*12). Fully Connected Layer: It computes the score of different classes, so the size of the volume (1*1*8) it means 8 categories of data.
Consider another example the given input image volume is size 32*32*3. If the filter size is given 5*5 then the every neuron in the Conv-layer have the weights as 5*5*3 a volume with the total of 75 weights and+1 as parameter bias. The next the process is carried by computing the new volumes for each Wi1 = 5,Hi1 = 5,Di1 = 3 for a sample parameter with K = 2 and F = 3,S=2,P=1. The output channel dimension has size [5 – 3 + 2]/2 + 1 = 3.
After completion of the feature extraction the predicted image is given to MSVM training it classifies the recognized actions. The advantage of the proposed method for detecting the suspicious activity, it uses the combination of both CNN and multiclass SVM (MSVM) with one to all method for classification. Here if a problem is having Kn classes, one to all method creates kn hyper-planes for one and other are (kn-1) classes if suppose one to one is considered the creates hyper-plane kn(kn-1)/2 the pair classes are separated. Class labels are of different types, type 1 with the activity detection of fight scene in crowd and classified as violence activity, type 2 and 3 with the nonviolence activity such as handclapping, hand shaking and normal behavior of human activity, type 4, 5 with the suspicious activity such as robbery, pickpocket, running etc. Classification is based on the predicted label where it is classified as three types are violence activity, suspicious activity and non-suspicious activity the obtained results shown in Table 1. The performance of the proposed method with CNN and MSVM is shown in the Table 3.
The sample results using different datasets
The system is tested by setting up the MATLAB environment in our computer with Intel CoreTM i9-7900X 4.3 GHz CPU,128 GB memory. For experimental analysis different public datasets are used.
The database KTH contains actions of human activities jogging, walking, boxing, running, hand waving and clapping used in different circumstances of indoor and outdoor applications [22].
BEHAVE database consists of various circumstances of people acting’s and interactions. Here 25 frames are captured per second. Each frame in database has a resolution about 640x480 in.avi format. This video database includes frames of Walk Together, Split, Ignore, Following, Chase, Fight, Run together, Meet and Group Fighting in crowded areas [23, 24].
FALL database contains 70 (30 falls+40 activities of daily living) sequences. These are captured in indoor environments [25].
Pickpocket database contains chain robbery, snatching and fire in public places. These are collected from the real time happenings in the cities. These videos are captured from the real-world happenings in open areas like bus stand, railway station and public places [26, 27].
The training images are loaded in the neural network toolbox for manipulating the data size and for easy retrieval the data are also stored in image data-store it acts like a cache for easy processing. The sample images are stored in neural network toolbox. Before processing the input, image has been resized and given as input to convolution network and further process through other layers and produces the predicted output.
Performance visualization
The options for training the deep learning neural network is stochastic gradient descent with momentum learning rate, epoch for training, mini batch for each iteration and the execution environment to function. These various metrics are showing the training progress and help us to determine the training accuracy. The each iteration involves in gradient estimation and the network parameters are updated. Epochs are a full pass through entire dataset. Minibatch size is a subset of training set that is used to evaluate the gradient function
Δw is the update at each iteration
α is learning rate
Qi(w) gradient estimate
In this Loss value implies weakly model behaves after each iteration of optimization. This function is a cross-entropy loss that is a failure in the classification.
The Training process of the CNN process for some epochs is shown in Table 2 and performance comparison with different datasets shows in Table 3. The performance loss in training phase is shown in the below Fig. 4.

Training loss curve of CNN.
Average values for training CNN process
Average values obtained by classifiers for different actions
The Loss percentage decreases when the CNN layers are added more than once. It also improves the accuracy level for predicting the input. Mini-batch loss on multiple layers shows the best performance than on the single layer. The Fig. 5 shows the average values of proposed method obtained by testing with different classifiers with different datasets.

Performance comparison Proposed with existing classifiers for different actions.
In the proposed model, an effective approach for human activity recognition and classification is done by using the multi-class SVM approach. The classification is based on the score and classified the actions based on recognizing the activity. The experiment is carried out with a different set of activity happened in a crowd and this training approach achieves high performance by using GPU setup for training. CNN is fast processing network with large data when compared to other network. The dataset is collected from various experiments and some real time videos happened in the past for training and testing. The activity is recognized here are the fight scene which is captured in crowd area and robbery, running activity in the crowd and pickpocket happening in the crowd. During training the images are passed through four different layers and final it produces the output predicted. Based on the learning rate and iterations the accuracy and time varies, if the rate of learning is more, the time to process the images is high and processing speed reduces to avoid this problem with the initial learning rate must be less. At each time while passing through the layers it filters the image patches for fast processing of large number of samples. In our method the trained images are classified as actions for recognition of activity. These further classifieds using MSVM for specifying the activity as suspicious or not and gives a warning message when detected it as suspicious. Processing speed and accuracy increases when compared to other classifiers KNN, Random Forest models. This also reduces the false alarm making by detecting the normal action as suspicious and giving warning to the surveillance system.
