Abstract
Rare event detections are performed using spatial domain and frequency domain-based procedures. Omnipresent surveillance camera footages are increasing exponentially due course the time. Monitoring all the events manually is an insignificant and more time-consuming process. Therefore, an automated rare event detection contrivance is required to make this process manageable. In this work, a Context-Free Grammar (CFG) is developed for detecting rare events from a video stream and Artificial Neural Network (ANN) is used to train CFG. A set of dedicated algorithms are used to perform frame split process, edge detection, background subtraction and convert the processed data into CFG. The developed CFG is converted into nodes and edges to form a graph. The graph is given to the input layer of an ANN to classify normal and rare event classes. Graph derived from CFG using input video stream is used to train ANN Further the performance of developed Artificial Neural Network Based Context-Free Grammar – Rare Event Detection (ACFG-RED) is compared with other existing techniques and performance metrics such as accuracy, precision, sensitivity, recall, average processing time and average processing power are used for performance estimation and analyzed. Better performance metrics values have been observed for the ANN-CFG model compared with other techniques. The developed model will provide a better solution in detecting rare events using video streams.
Introduction
The modern world is full of surveillance cameras everywhere and they produce an enormous volume of data constantly. Monitoring the activities that are happening in front of numerous cameras and triggering an alarm for unusual events are merely a complicated task for the human. Therefore, some rare event detection procedures are used to find out the unusual events from a regular video stream. These procedures are majorly adopted based on the spatial domain or in the frequency domain. In this work, the spatial and frequency domains are enclosed by the CFG representation. Further, the input data are converted into graphs to be processed in ANN.
In general, grammar is used to define the programming structure of a computer language. In general, CFG [13] is the best mechanism for describing a structure mathematically in a simple and cherished way and is used widely in computer applications because of its recursive-definition nature. There are some predefined types in CFG such as, one-to-one, and one-to-many and the rules in CFG are free to use with any context. The ability to check the grammar with reversed rules makes it possible to use computer applications naturally. Languages generated by CFG are called Context-Free Languages, which are not entailed for handling adjectives and adverbs which are important for natural languages, instead, context-free language can handle nouns and verbs, which are important for computer programming languages. A CFG is defined by 4-tuple G = (V, Σ, R, S) where V is a finite set of elements, Σ is the finite set of terminals, R is the finite relation from V, and S is the initial variable. Presently, CFG is an unavoidable concept in computer science because the widely used Extensible Markup Language (XML) [7, 12] is completely dependent on CFG. Automating web-related tasks and web searching in these days are performed by CFG based XML. In this work, the concept of CFG is used to find rare event occurred in a video stream.
The aspire of developing ACFG-RED is to increase the quality of the video mining process by improving the parameters such as accuracy, precision, and recall. The algorithmic design of ACFG-RED is also constructed to improving the speed with reasonable power consumption. Accuracy is one of the important parameters used in mining and classification procedures. Accuracy is defined as the closeness of a metered value to the actual value. The value of accuracy is directly proportional to the quality of a predicting model or procedure. The prediction model provides the highest accuracy means, this procedure is said to be the best in its class. Precision is defined as the closeness between more than one measured values with each other. Precision is also directly proportional to the quality of a classification or prediction model. A recall is another important parameter to determine the quality of a classification procedure.
The rest of the paper is structured as follows. Section 2 discusses the overview of some existing related work in this field. Section 3 discusses the system model used in the proposed scheme. Section 4 presents the experimental setup of the proposed scheme in detail. Section 5 gives a comparative analysis of the proposed protocol regarding the accuracy, precision, recall, average processing time, and average processing power with related existing schemes. Finally, Section 6 concludes the paper.
Existing methods
Rare event detection is the process of finding abnormal events in a video stream or frame. To do that, in splitting, the key frames, and subsequent frames, nullifying constant background from active movement contents and analyzing the flow of the video by monitoring the frames. Some notable existing methods are used to find rare event detection from the videos. A clear study of the existing methods is performed and presented here to discuss their advantages and limitations. The existing methods taken here are a unified framework for event summarization and rare event detection from Multiple Views (ESRED-MV) [14], Recognition of composite human activities through CFG representation (RCHA-CFG), Towards Abnormal Trajectory and Event Detection in Video Surveillance (TAT-EDVS) [8] and abnormal event detection in video using Spatiotemporal Auto encoder (AED-VSA)
A unified framework for event summarization and rare event detection from multiple views
ESRED-MV [14] handles event summarization and rare event detection by converting the data as a graph editing problem in a single framework. The input video is represented as a graph in which each node represents different events. The events are detected by the spatial and temporal segmentation of the input video. The functional blocks of ESRED-MV are Graph-Structure Learning, Graph-Structure Editing, and Graph-Structure Matching. Spatial Decomposition is the first process in ESRED-MV; the events are preliminarily identified in different spatial locations. The input video is segmented as 10x10 pixel blocks, and then they are clustered based on the similarities of spatiotemporal activity patterns. The temporal decomposition is performed using the optical flow calculation. In this method, the optical flow calculation is performed by the Lucas-Kanade Method. After the temporal decomposition phase, learning edges are constructed and the energy minimization process is performed using Markov Chain Monte Carlo (MCMC) method. Then the even summarization process takes place which is used to detect the rare event detection achieved by matching the video-structure graph, sub-graph matching, and relation matching.
Recognition of composite human activities through context-free grammar representation
RCHA-CFG [19] operates based on Body-part layer and poses layer properties. Pixel-level, blob level, and object-level processing are done to extract information from raw data. A hierarchical mechanism proposed by Park and Aggarwal [20] is used to construct quantitative image features from the input frame. A gesture layer is constructed from body-part layer and pose layer. The predicates of RCHA-CFG are divided into three categories which include Temporal Predicates, Spatial Predicates, and Logical Predicates. These predicates are used to estimate the atomic and composite actions from the input frame. Since human activities have more probability to be recursive, the CFG representation is used here as a wise choice is used. The delegacies of these actions are constructed through the CFG which contains all sub-events and their correlations. Based on the time intervals and predicates, the composite actions are identified from the gesture layer.
Towards abnormal trajectory and event detection in video surveillance
TAT-EDVS [8] introduces fuzzing object trajectory analysis and pixel-based features for abnormal behavior detection. The pixel-based features are selected by the authors of TAT-EDVS stating that it is state-of-the-art technology in action rejection [16]. TAT-EDVS used an object tracking algorithm [17] and a group tracking algorithm [18] to extract complete movements of all objects and groups from a frame of the video. After this phase, TAT-EDVS uses Grid-based Analysis which consists of -tasks like Trajectory Snapping, Zone Discovery, and Trajectory based Anomaly Detection. Then abnormal trajectories are classified by the Trajectory Filtering process of TAT-EDVS. It provides a unified framework for comparison and also offers good results sub in finding rare events from videos.
Abnormal event detection in video using spatiotemporal auto encoder
AED-VSA [9] is acknowledged that the work is developed based on the inspiration of learning temporal regularity in video sequences [10]. It is a complete model that has a spatial feature extractor and temporal encoder/decoder to learn the temporal patterns of a set of frames. The main functional blocks of AED-VSA are Preprocessing, Feature Learning, Regularity score, and Anomaly Detection. The Feature Learning block performs essential tasks such as Auto-encoder, Spatial Convolution, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Convolution LSTM. The Anomaly Detection Block includes processes threshold and event count. The application of Spatial Feature Extractor and Temporal Sequencer easily handles the problem of Anomaly Detection from a set of video frames. This work is analyzed here due to its efficiency in finding rare events from a set of video frames. The summary of the experiments carried out in existing methods and their merits/limitations are presented in the following Table 2.
Notations used throughout the paper
Notations used throughout the paper
Summary of existing methods-merits and limitations
In this work, Artificial Neural Network-based Context-Free Grammar for Rare Event Detection (ACFG-RED) is designed with the functional modules such as Background Subtraction [5], Edge detection, Raster edge movement to Vector conversion, Edge movement vector representation in CFG and Knuth–Morris–Pratt pattern searching based Artificial Neural Network (KMP-ANN) [4]. ACFG-RED framework is designed to analyze and also find rare events from recorded video files as well as from real-time video streams.
Background subtraction
It is one of the important tasks in action recognition and video mining processes. Handling entire data from the video for the classification and recognition process is a huge job that consumes a lot of computational resources such as processing time and energy. The background subtraction process is a key task to separate constant or rarely changing data in a video. To be specific, a lot of video compression algorithms are designed based on the background subtraction technique to avoid storing and processing redundant data.
The proposed background subtraction process takes place in the spatial domain itself. This saves the conversion time between spatial and frequency domains. The input frames are monitored for considerable changes and the frames are treated as key frames for background subtraction. This process takes some time for initialization but after that, this procedure operates on the fly in real-time. Since ACFG-RED is intended to handle full HD videos, the frame resolution is fixed to 1280 x 1080 pixels and 24-bit color depth. One pixel refers to the compound of 8-bits for each Red, Green, and Blue shades. Therefore, a single full video frame contains 1382400 pixels and a frame is considered as a key frame when 40% (i.e. 552960 pixels) of the pixel values are changed from its predecessor frame. Then the common part between the successive key frames is recorded as the background and the portions are maintained as fixed instead of having recurring data. Whenever there is a huge change in the background, the change is registered as the new background, and the existing one is discarded. This process enables to handle different video daylight and night scenario backgrounds.
The background subtraction process performed in the frames and result image is given in Fig 1. It is also considered as a rare event in ACFG- RED. The background is illustrated dual-color representations of extreme white RGB (255,255,255) and Dark Black RGB (0,0,0) for ease of operation. The background data update takes place for every key frame, thus it is dynamic.

Initial Frame, Subsequent Key Frame, Key Frames Overlapped, and Background Image.
The edge detection process is performed only in the foreground area of a picture frame. Detecting edges for video background is eliminated with the help of the background subtraction process. The purpose of edge detection is to collect the movement data from the input video. For edge detection, ACFG-RED uses horizontal and vertical fast scanning procedure. The values of two subsequent pixels are used to calculate the difference and to identify the edges. The pixel values are noted as an RGB matrix and differences in pixel values are calculated by Equation 1.
Where α, β are two successive pixels and ∂ is the difference matrix. The edge is referred if the value of any of the element r, g, b in ∂ is greater than the threshold value which is assigned as 7 for toll traffic videos. While scanning horizontally all vertical edges are marked and horizontal edges are marked during the vertical scanning. The Horizontal Scanning procedure starts from the left-top portion of the frame and scan through the image until the pointer reaches the right bottom. The edges are identified using the following equation in horizontal scanning.
Where e ij edge represent a bit of (i, j) th pixel, l, t, r, and b are left, top, right, and bottom respectively.
The equation to find the edges in vertical scanning is given as Equation (3)
The detected edges for the video frame are highlighted and represented in Fig. 2. The edge detection process is performed in real-time during the video playing or streaming. Therefore, the continuous track record is maintained in memory to detect the movements in the video.

Detected edges of video frame.
The movements are captured by the edges and the direction of movements is stored as the vectors rather than raster special Image. The vector conversion process reduces the memory requirement and improves the processing speed. The dimensions of the edges are used to determine the action in the video. If f1, f2 ⋯ f n are the frames in which e1, e2 ⋯ e m are the edges that exist in the frames, then the total number of edges in a video sequence is e tot = m × n and they are represented as e(1,1), e(1,2) … e(2,1), e(2,2) ⋯ e(m,n). The sizes of edges are represented individuallyas s (1,1) , s (1,2) , ... s (2,1) , s (2,2) , ... s (m,n) .Then the condition ∀i = b → c : ifs(x,i) < s(x,(i+1)) where 1 ⩽ b < c and b < c ⩽ n show that the object which is detected by the edge e x is approaching the camera. Here s x refers to the size of the edge e x . In the toll traffic dataset is used in ACFG-RED, this condition is used to detect the incoming vehicles.
Similarly, the condition ∀i = b → c : ifs(x,i) > s(x,(i+1)) refers to outgoing vehicles. There will be no specific change in the size of the edges of vehicles while they are not moving. There is a criterion here in the conversion that the process should be rapid to achieve on the fly operation of ACFG-RED. Therefore, the frame is split into smaller grids as given in Fig 3. This splitting process helps to improve computational speed because the pixel-level edge movements are grouped into block level. It is admitted that the frame splitting process of the proposed procedure compromises some accuracy to improve the processing speed. It is also learned that the difference in accuracy is very negligible for detecting rare events from a video. The 1920x1080 frame is split into 144 blocks in such a way that both the height and width of the frame are divided with 12 grid lines. Each block will be of resolution 160x90 pixels. The blocks are represented as Γ(0,0) → Γ(11,11). The partitions of blocks and alignment are given in Fig 3. To explain the construction of the movement vector, the edges of the frame f560 i.e. a video frame of t + 28 second is given in Fig 4, where t is the initial time of input video. The captured edge movements from f480, f500, f520, f540 and f560 are given in Fig 5.

Frame blocks alignment.

Edges e(34,560) and e(40,560).

Movements of Edges e(34,560) & e(40,560).
As per the observations the movement of the edge e34 from block Γ(4,10) to Γ(6,5) is the storable entity as well as the movement of the edge e40 from block Γ(7,11) to Γ(8,9) is also recordable. The edge vector of the edge e34 is represented as
Therefore, the positions of the edges in different video frames based on time can be accessed through the edge vectors. The generalized equation of an edge vector is as
Where
Memory allocation for edge vector representation is based on the typical and the maximum number of frame tracks taken for and edge. In this case, for toll traffic data, the typical vector element size is 5 + 5 which is 5 numbers of blocks and 5 numbers of corresponding frames have been chosen. The maximum permitted edge movement vector size in ACFG-RED is 20 + 20. The size of a typical edge vector
Γ(h1,v1) consumes 1 bytes in memory, bx1 is the horizontal partition of the block Γ and by1 is the vertical partition. Since the maximum number of partitions is set to 12 for a 1920x1080 resolution image in ACFG-RED, 4-bits are used to represent each horizontal and vertical partition. Therefore, one block represents consumes 1 byte in memory and a typical edge vector representation takes 5 bytes of memory for block storage. To store a single frame number, 8-bytes are used. To represent a typical 5 frame count edge vector, it requires 40 bytes of memory. A 4-byte memory space is used to store the edge number in the vector. Therefore, a typical edge vector will be of 5 + 40 + 4 = 49 bytes of memory. Since the maximum permitted element count is limited to 20 for an edge vector representation in ACFG, it will take 1 × 20 + 8 ×20 + 4 =184 bytes to store a big edge vector.
The size of an active edge movement region in a 1920×1080 pixel video frame will be around a 1540x480 pixel – highlighted in Fig 6. To hold such a raster image in memory, it will take 1540 × 480 × 3 =2217600 bytes in memory; here 1540 is the width, 480 is the height and 3 is to store the pixel color information in RGB. Thus, the vector representation of edge movements reduces a huge size of memory requirement in ACFG-RED.

Active edge movement region.
This part is used to find interrelated edges in a video to identify the objects. As per the grammar theory, a set of strings over a set of finite symbols is a language. A context-free grammar is defined as G ={ N, T, P, S }, where N is the finite set of non-terminal characters, T is the finite set of terminals where N∩ T = Ø, P is the set of rules such as P : N → (N ∪ T) * that is the production rule P which can have any right or left context and S is the start symbol. In ACFG-RED, the edge movement vectors are treated as non-terminal characters. The movement relationships between the edges are correlated as the production rules. The starting symbol is the first occurrence edge movement vector. As per the ACFG-RED definitions
The video frame center point is calculated as
Using this grammar, following direction strings are derived
The movements categories set is defined as d m ={ d0, d1, d2, d3 } where d0 = (+ , +), d1 = (+ , -), d2 = (- , +) and d3 = (- , -) as in Fig. 7.

Movement Categories Set.
The Chomsky Normal form of the developed CFG is 0 ↦ A|B|C, A → AA, B → A + , C → A-. The movement calculation is performed using the following equation.
Since the video has no direct z-axis representations, the x and y coordinates are used to establish the z-axis visionary by the Euclidean range calculation of Equation 10. The edges with similar block shifts based on the time durations are considered as the member edges of an object in the input video frame. For example, as per Fig 5, the movements of edges e34 and e40 are following the same movements d0 → d3, it is easy to understand that the edges belong to the same object. The edge movement vectors are converted into strings using Equation 8 and 10 to apply Knuth-Morris-Pratt pattern matching.
This part is used to find out the rare events in a video which is converted as edge movement vectors strings. In ACFG-RED, the KMP algorithm is added with ANN to update the knowledge base for rare event detection. The standard KMP [4] algorithm performs a sequential search to find the string matching.
ACFG-RED string search algorithm: Let Ψ
ω
is the string of the input video constructed using equation 8 and 10 Let Ψ
ω
1
, Ψ
ω
2
⋯ Ψ
ω
n
substrings derived from Ψ
ω
based on the edge scopes by equation 5
Let Ψ
n
={ Ψn1, Ψn2 ⋯ Ψn
m
} is the set of normal event strings Let Ψ
r
={ Ψr1, Ψr2 ⋯ Ψr
n
} is the set of rare event strings For i = 0 to n Set REF = 0 (REF: Rare Event Flag) Search Ψ
ω
i
for any member string of Ψ
r
, set REF = 1 if so If REF = 1, then report Ψ
ω
i
as a rare event string Else For j = 0 to m set NEF = 0 (NEF: Normal Event Flag) Search Ψ
ω
i
for any member string of Ψ
n
, set NEF = 1 if so If NEF = 1, then report Ψ
ω
i
as normal event string Else call ANN to update Ψ
n
and/or Ψ
r
End For [j] End For [i] End
The ANN is designed with the input layer in which the nodes receive the inputs such as edge movement direction, the time duration of edge movements. There are two hidden layers are assigned to operate with the input layers. The first hidden layer is indented to give weightage to the edge movement direction and the second hidden layer is indented to process the edge movement direction with the time duration. The output layer has two nodes for normal and rare event classifications. 18 nodesare placed in the input layer of ANN for edge movement directions { (+ , +) → (+ , +) }, { (+ , +) → (+ , -) }, { (+ , +) → (- , +) }, { (+ , +) → (- , -) }... .{ (- , -)→ (- , -) }, {hordedirections} and {movement duration}.The ANN architecture, ANN training phase, and overall flow diagram have been illustrated in Figs. 8, 9 & 10 respectively.

KMP-ANN architecture.

ANN training flow diagram.

Overall flow diagram.
Accuracy
Precision
Recall
Average Processing Time
ACFG-RED is implemented using Visual Studio IDE [11]. Since the ANN training phase requires non-stop execution, it is performed in a leased HP ProLiant DL160 Gen9 Intel Xeon E5-2620v4 12 core cloud server. Some CCTV footage videos are used in the ANN training phase. The trained knowledgebase is used to test the ACFG-RED procedure with a video recorded in a Toll collection booth. The video is of 12 Minutes 29 seconds length with 1920x1080 FHD resolutions at 20 frames per second. The data rate of this video is 11611kbps with a total bit rate of 11785kbps. The storage size of the video is 1105413017 Bytes (1.02 GB). A dedicated User interface is designed to test the functionalities of ACFG-RED. The user interface is given in Fig. 11.

ACFG_RED user interface.
Accuracy, Precision, and Recall are the standard parameters to measure the quality of a data mining and classification procedure. Average Processing Time and Average Processing Power are measured along with the standard parameters for existing ESRED-MV, RCHA-CFG, TAT-EDVS, AED-VSA, and for proposed ACFG-RED. The performances of these procedures are measured for the entire video with 10 different timestamps and the results are described in Tables 3 to 7. As per the experiment results, the method ACFG-RED scored the highest accuracy average of 89.299 %. Proposed ACFG-RED consumes relatively lesser power than other methods for rare event detection process that shows its efficiency.
Average Processing Power
Average Processing Power
Accuracy is the prime metric of any classification procedure. Accuracy is calculated using the formula
Based on the observations, the Accuracy averages of ESRED-MV, RCHA-CFG, TAT-EDVS, AED-VSA, and ACFG-RED are 81.006, 78.947, 83.971, 79.844, and 89.098 respectively. The proposed ACFG-RED method has secured the highest average Accuracy in finding rare events from the input video also described in Fig 12.

Accuracy Graph.
Precision is one of the standard metrics of classification and mining procedures. Precision is calculated as

Precision Graph.
Recall otherwise known as true positive rate refers to the successful retrieval of relevant events among the total number of events. A recall is calculated by the formula

Recall Graph.
The average processing time of ACFG-RED for processing 24 frames is 838.7mS which is less than a second. The measured Average Processing Time values are given in Table 6 and the comparison graph is given in Fig 15.

Average Processing Time Graph.
The processing power consumption refers to the electrical power consumed by a process that is measured in Watts. The processing powers consumed by existing and proposed methods are given in Table 7. The comparison graph is given in Fig 16.

Average processing power graph.
This work presents a framework for detecting rare events from video using CFG and ANN. The developed model ACFG-RED has been implemented and tested using real-time video recorded at Toll Collection Centre. The performance of the proposed model was computed and compared with other similar existing models. The proposed framework can handle both stored video files as well as real-time streaming videos for rare event detection with better efficiency. The lesser power consumption and higher processing speed make ACFG-RED more suitable for detecting rare events in real-time CCTV surveillance systems. Further, the proposed rare event detection module is also used in unmanned CCTV monitoring units.
Footnotes
Acknowledgments
The authors are grateful to the Department of Science & Technology, New Delhi, India (SR/FST/ETI-371/2014) for their financial support. Also express their sincere thanks to SASTRA Deemed University, Thanjavur for extending infrastructural support to carry out the work.
