Crime prediction and mapping based on real time video analysis

Abstract

This paper presents a three phase approach to crime prediction based on video analysis, neuro-fuzzy inference and density mapping. In the first phase, crime indicator concepts are modeled and used in building classifiers for crime indicator events. Both indicator concept modeling and indicator event classification are performed using Generalized Maximum Clique Problem (GMCP) method. In the second phase, a neuro-fuzzy inference system modeled from training data is used to make predictions about classified crime indicator events obtained from the first phase. Finally in the third phase, kernel density estimation (KDE) is used to fit a spatial probability density function to the predicted crime indicator events across the study area. The major advantages of this method include the potential to predict crime in real time due to the use of video based events, the ability to generate fuzzy rules from data, the ability to optimize fuzzy rule-base by learning and the ability of weighting different crime variables. The proposed framework has prospects for developing a police field decision support system. The feasibility of the framework has been tested in a simulated experiment using sampled clips from violent scene detection (VSD) 2014, Hollywood Human Action (HOHA) and HMDB datasets and the results are quite promising for real life implementation.

Keywords

Video analysis smart surveillance crime prediction crime mapping neuro-fuzzy inference

1. Introduction

Crime prediction is a continuously relevant area of research giving the demands of today’s society and the threats of global terrorism. Crime prediction research is often supported by criminology theories. Prominent among the criminology theories include the routine activity theory, rational choice theory and crime pattern theory [5,18]. Analyzing these theories, the understanding one gets is that criminals and victims follow certain life patterns, crime patterns are influenced by geographic and temporal factors and criminals make rational decisions based on risks and benefits. These patterns can be extracted using image processing and event detection methods.

Crime mapping is a significant component of crime prediction since it allows the visualization of crime patterns. There are several methods of crime mapping which includes: point mapping, spatial ellipse, thematic mapping and Kernel density estimation (KDE). KDE is viewed as the most suitable method for visualizing crime data as a continuous surface [9,22,30]. As such, in this study we adopted KDE for visualizing our predicted crime indicator events.

Conventional crime prediction methods are mostly based on data-mining and spatial statistical techniques. These methods often rely on a combination of crime incident data, socio-economic, demographic (population related statistics) and spatial factors for predicting crime [6,8,18,30]. More recently researchers have expanded these methods to include social media information in order to achieve better performance [1,3,25,26]. Even though these methods have demonstrated good performance especially in forecasting future crimes, the major drawback of these methods is the failure to predict crime in real time.

Another direction of research with relevance to crime prevention is smart surveillance. Smart surveillance methods make use of automatic image understanding and video based event detection methods for purposes of security alertness or situation awareness. In our opinion, smart surveillance offers the best prospects for predicting crime in real time. Nonetheless, existing works often focused on detecting abnormal events or enhancing security alerting systems in special areas of interest like airports, parking lots, shopping malls and so on [4,27,29]. Besides that, there are some open research issues regarding the performance of event detection methods. Such issues include false alarm rate, identifying best key-point descriptors, tracking partly occluded entities, optimizing learning parameters and other camera related issues (like viewing angle, viewing distance).

In light of the above, this study intends to the explore viability of using wide area video surveillance for the purpose of crime prediction and crime mapping. In short the main contributions of this paper as include: (1) Defining and modeling crime indicator events in videos. (2) Identifying the best feature descriptors for video based crime indicator concepts. (3) Evaluating existing video based concept detection methods. (4) Establishing GMCP as an optimal classifier choice for video based crime indicator event detection. (5) Building a neuro-fuzzy crime prediction system using training data. (6) Combining video event detection, neuro-fuzzy inference and kernel density techniques for wide area crime pattern visualization. The rest of the paper is organized as follows. Section 2 presents an overview of relevant literature. Section 3 presents the proposed solution. Section 4 contains description of the datasets and experiments. Section 5 contains discussion. Section 6 concludes the paper with summary of our contributions and future research direction.

2. Relevant literature

The study seeks to tackle crime prediction from a multidisciplinary and an integrated viewpoint, by bringing together techniques of video based event detection, hybrid intelligent systems and Mapping techniques. As such the literature will be reviewed under three subsections for the sake of clarity and easy comprehension.

2.1. Video based event detection

This is one of the well-researched areas of computer vision and there are several approaches to video based event detection. Notable among these approaches are concept based event detection (CBED) methods. These methods use the detection of semantic concepts for modeling and classifying events in videos. Some examples include Izadinia and Shah [14] who used low-level event concepts models for recognizing complex events under a latent support vector machine (LSVM) framework. Jingen et al. [17] proposed a video based event recognition method using concept attributes. In their method, events are modeled in a semantic concept space using a variety of complementary semantic features. Ehsan et al. [10] also used a combination of trained concept detectors with a latent temporal model for event detection. Their argument is that, concepts in an event tend to articulate over a discernible temporal structure. Hence they exploited the temporal model using the scores of concept detectors as measurements. Another method is Shayan et al. [24] who proposed a contextual approach to video classification. In their approach, events are modeled based on co-occurrence of concepts and new events are classified by matching semantic co-occurrence patterns with reference event models.

2.2. Hybrid intelligent systems

Hybrid intelligent systems are intelligent expert systems which combines at least two intelligent technologies like fuzzy logic, neural networks, evolutionary computation, probabilistic reasoning and so on. These combinations have led to the emergence of systems that are capable of reasoning and learning under uncertain and imprecise circumstances. In the last few decades, the integration of Fuzzy Logic and Neural Networks in particular has attracted a lot of attention across various fields of science and engineering research. Some of the notable works include: ANFIS proposed by Jang [15]. ANFIS is a Sugeno-type fuzzy inference system with five layers. This method relies on backpropagation learning for determining the premise parameters and least mean square estimation for determining the consequent parameters. Nauck and Kruse [7] also proposed NEFCON which is Mamdani-type fuzzy inference system. In this model, the learning process is based on a mixture of reinforcement and backpropagation learning. NEFCON can be used to learn an initial rule base if no prior knowledge about the system is available or even optimize a manually defined rule base. NEFCON has two variants: NEFPROX (for function approximation) and NEFCLASS (for classification tasks).

2.3. Mapping techniques

Kernel density estimation (KDE) is one of the effective ways of visualizing the distribution of crime across space and time. Traditional KDE methods often depend on historic crime incident data. But recent improvements including Matthew [12] successfully integrated social media (twitter) information on a KDE framework in order to enhance performance. Dawei et al. [6] presented a different approach to hotspot mapping. In their framework they used spatial statistics and data mining concepts to map crime hotspots and investigate the relationship between socio-economic factors and crime variables. They created a Hotspot Optimization Tool (HOT) which identifies crime hotspots through related variables.

3. Proposed solution

The proposed framework is made up of three phases’ which are video analysis, crime prediction and crime mapping. The first phase adopts contextual approach to video classification and context models are built using co-occurrence of concepts. To do this, event classes are modeled based on concept co-occurrence within the classes and new videos are classified by matching semantic co-occurrence patterns with event representation models. The matching is performed by finding the strongest clique of co-occurring concepts in a video based on Generalized Maximum Clique Problem (GMCP) which is solved using Mixed Binary Integer Programming.

In the second phase, crime predictive analysis is performed using a neuro-fuzzy approach. The adoption of this approach allows the easy integration of other crime predictive factors (such as time of event and crime rate in the area) and the weighting of these factors. In the third phase, kernel density estimation (KDE) is used to fit a spatial probability density function to the crime imminence data across the study area. This allows the visualization of crime imminence levels in spatial space. The assertion held by this study is that real time video data provides a more pragmatic source for identifying and understanding crime imminent regions as compared to historic crime incident data. This assertion is supported by a widely held psychological view that before a crime occurs there are usually indicators or leading events [5], hence this study attempts to capture crime indicators through analysis of surveillance videos. Figure 1 is the proposed framework diagram.

Fig. 1.

Proposed framework diagram.

3.1. Overview of concept detection

To build the concept detection model, first the probability of concepts co-occurrence are computed from annotated training videos and saved as reference co-occurrence matrices. Then the query videos (which should be the surveillance footage) are divided into clips of equal interval. Given that the number of trained concepts is k and the number of clips in the query video is h. The k trained concept detectors are applied to each clip and the resulting $k \times h$ confidence values, along with their corresponding reference co-occurrence matrices (computed from the training video clips) are used to form a graph G. Each clip in the graph contains a cluster of k nodes representing event concepts in that clip. The edge weights are the probabilities of co-occurrence of corresponding event concepts in the query video (which depends on both the SVM confidence values and the reference co-occurrence matrices). To find the set of concepts with maximal contextual agreement, GMCP is computed for the graph [24,28].

3.1.1. Context model

To capture context within videos, a pairwise co-occurrence of concepts strategy is used. The strategy is built based on the conditional probability of concept coincidence. As such the reference co-occurrence matrices are computed using the conditional probability of concept coincidence [24,29]: $\begin{matrix} (1) & Φ (a, b) = p (a / b) = \frac{# (a, b)}{# (b)}, \end{matrix}$ where $# (a)$ is the number of training videos containing concept a and $# (a, b)$ represents the number of videos containing both a and b. With the self-co-occurrence element which is $Φ (a, a)$ , the numerator term $# (a, a)$ is obtained from the number of videos where concept a occurs more than once. The element $Φ (a, b)$ is equal to the conditional probability that concept a happens given concept b. Using conditional probability to define the co-occurrence matrix has some advantages. First and foremost it absolves the penalty of concepts which tend to co-occur less often. Secondly it makes the resulting co-occurrence matrices asymmetric, hence the chance of concept a happening given that concept b happens is not necessarily the same and the other way round.

3.1.2. Using GMCP for concept detection

Given that, graph $G = {V, L, w}$ is the input to our concept detection model, where V denotes the set of nodes, L denotes the edges and w denotes edge weights. If we divide nodes in V into separate clusters where each cluster C represents one clip in the test video. Then the nodes within the separated clusters will represent the concepts of each particular clip. This can be expressed as: $C_{j} = {α_{1}^{j}, α_{2}^{j}, α_{3}^{j}, \dots, α_{k}^{j}}$ where $α_{i}^{j}$ is the ith concept of the jth clip. L is the edges connecting all pairs of nodes in V from different clusters. w is expressed as: $\begin{matrix} (2) & w (α_{i}^{j}, α_{l}^{m}) = Φ (α_{i}^{j}, α_{l}^{m}) \cdot ψ (α_{l}^{m}), \end{matrix}$ where $Φ (α_{i}^{j}, α_{l}^{m})$ computes the contextual agreement between concepts $α_{i}^{j}$ and $α_{l}^{m}$ coming from two separate clips. The k trained concept detectors are then applied to each clip. The component $ψ (α_{l}^{m})$ represents the confidence value of the lth concept detector applied to the mth clip. It should be noted that the edge weight $w (α_{i}^{j}, α_{l}^{m})$ is equivalent to the probability of $α_{i}^{j}$ and $α_{l}^{m}$ occurring in the query video clip and it is expressed as: $\begin{matrix} (3) & p (α_{i}^{j} \cap α_{l}^{m}) = p (α_{i}^{j} / α_{l}^{m}) \cdot (α_{l}^{m}) . \end{matrix}$

The first term is obtained from the reference co-occurrence matrix and the second is the SVM confidence value. Thus the larger the edge weights the higher the probability for the parent concepts of $α_{i}^{j}$ and $α_{l}^{m}$ to co-occur in a test video.

To perform the concept detection, one concept is assigned to each clip of the test video. And a subgraph of G defined as $G_{s} = {V_{s}, L_{s}, w_{s}}$ is used to obtained a feasible solution to the problem. The set of nodes of the subgraph must contain only one node from each clip. $V_{s}$ , $L_{s}$ , and $w_{s}$ are subsets of V, L, and $w_{s}$ respectively. A utility function is used to assign score to the feasible solution of $G_{s}$ . The function expressed as follows: $\begin{matrix} (4) & \begin{matrix} U (G_{s}) = & \frac{1}{h \cdot (h - 1)} \\ \times \sum_{p = 1}^{h} \sum_{q = 1, q \neq p}^{h} w (V_{s} (p), V_{s} (q)) . \end{matrix} \end{matrix}$

Equation (4) aggregates all the possible pairwise relationships between different concepts in different clips and the set of contextually consistent concepts is found using the following expression: $\begin{matrix} (5) & \begin{matrix} G_{s}^{*} & = \underset{G_{s}}{arg max} U (G_{s}) \\ = \underset{V_{s}}{arg max} \sum_{p = 1}^{h} \sum_{q = 1, q \neq p}^{h} w (V_{s} (p), V_{s} (q)) . \end{matrix} \end{matrix}$

GMCP is used in solving the above combinatorial optimization problem. The objective of GMCP is to find a subgraph within a complete graph in such a way that the sum of the edge weights is optimized [2,11,24].

3.1.3. Using GMCP for video classification

By representing a video class based on the co-occurrence of its concepts, GMCP can be used to classify the video. As stated in [24] a class specific co-occurrence matrix $Φ^{'}$ can be expressed as: $\begin{matrix} (6) & Φ^{'} (a, b, ε) = p (a / b, ε) = \frac{#_{ε} (a, b)}{#_{ε} (a)}, \end{matrix}$ where $#_{ε} (a)$ denotes the number of training videos of class ε which contains concept a, and $#_{ε} (a, b)$ denotes the number of training videos of class ε which contains both concepts a and b. As such $Φ^{'} (\cdot, \cdot, ε)$ contains the pattern of concept co-occurrences for class ε. To perform the classification using GMCP, the input graph $G^{'}$ representing the test video can be expressed as $G^{'} = {V, L, w^{'}, ε}$ . The set of nodes V and edges L are the same as in G and the edge weight $w^{'}$ is computed as follows: $\begin{matrix} (7) & w^{'} (α_{i}^{j}, α_{l}^{m}, ε) = Φ (α_{i}^{j}, α_{l}^{m}, ε) \cdot ψ (α_{l}^{m}), \end{matrix}$ where ε is the class whose co-occurrence matrix is used to compute the edge weights. Assuming the dataset has E number of classes, then E different input graphs $G^{'}$ can be formed for a test video. Similar to the concept detection method a feasible solution to this classification problem can be obtained by defining a subgraph of $G^{'}$ given as $G_{s}^{'} = {V_{s}, L_{s}, w_{s}^{'}, ε}$ . The utility function which assigns E different scores to the feasible solution $G_{s}^{'}$ can be expressed as: $\begin{matrix} (8) & \begin{matrix} U^{'} (G_{s}^{'}) = & \frac{1}{h \cdot (h - 1)} \\ \times \sum_{p = 1}^{h} \sum_{q = 1, q \neq p}^{h} w^{'} (V_{s} (p), V_{s} (q) ε) . \end{matrix} \end{matrix}$

By solving the optimization problem, the class which the test video belongs to can be found as follows: $\begin{matrix} (9) & {G_{s}^{*}, ε^{*}} = \underset{G_{s}^{'}, ε}{arg max} U^{'} (G_{s}^{'}, ε) . \end{matrix}$ Here $ε^{*}$ represents the class with highest score and $G_{s}^{*}$ represents the optimal subgraph found. In short, a test video is represented E times using E different co-occurrence matrices and GMCP is solved for each. Due to space constrain, readers can see [24] where Mixed Binary Integer Programming (MBIP) is proposed for solving GMCP.

3.2. Crime predictive analysis

As indicated earlier, this study adopts neuro-fuzzy approach for predicting crime imminence levels across the study area. Specifically an adaptive neuro-fuzzy inference system (ANFIS) [15] approach is used. As the name implies ANFIS exploits the strengths of both Artificial Neural Networks (ANN) and Fuzzy Inference Systems (FIS). In such a combination the learning capability becomes an advantage to the FIS and the formation of linguistic rule base becomes an advantage from the viewpoint of ANN.

3.2.1. Neuro-fuzzy model

ANFIS is a sugeno-type fuzzy inference system. A typical Sugeno fuzzy rule is expressed as follows [15]: $\begin{matrix} (10) & \begin{matrix} IF x_{1} is A_{1} AND x_{2} is A_{2}, \dots, AND x_{m} is A_{m} \\ THEN y = f (x_{1}, x_{2}, \dots, x_{m}) . \end{matrix} \end{matrix}$ Here $x_{1}, x_{2}, \dots, x_{m}$ are input variables and $A_{1}, A_{2}, \dots, A_{m}$ are fuzzy sets. When y is a constant, a zero-order Sugeno fuzzy model is obtained and the rule consequent is specified by a singleton. When y is a first-order polynomial, a first-order Sugeno fuzzy model is obtained: $\begin{matrix} (11) & y = k_{0} + k_{1} x_{1} + k_{2} x_{2} + \dots + k_{m} x_{m} . \end{matrix}$

Generally ANFIS can be explained in a six layered architecture. The first layer is often the input layer where the neurons simply pass external crisp signals to the next layer. The second layer is the fuzzification layer and neurons in this layer perform fuzzification. The third layer is used to compute the rule antecedent. Each neuron in this layer corresponds to a single Sugeno-type fuzzy rule. The rule neurons receive inputs from respective fuzzification neurons which are then used to calculate the firing strength of each rule. The conjunction of the rule antecedents is evaluated by the product operator. Thus, the output of neuron i in the third layer is obtained as follows: $\begin{matrix} (12) & y_{i}^{(3)} = \prod_{j = 1}^{k} x_{i j}^{(3)} y_{Π 1}^{(3)} = μ_{A 1} \times μ_{B 1} = μ_{1}, \end{matrix}$ where the value of $μ_{1}$ represents the firing strength, or the truth value of rule 1. The fourth layer normalizes the rule strengths. Each neuron in this layer receives inputs from all neurons in the rule layer, and the normalized firing strength of the given rule is calculated. The normalized firing strength is the ratio of the firing strength of a given rule to the sum of firing strengths of all rules. It represents the contribution of a given rule to the final result. Hence, the output of neuron i in fourth layer can be obtained using the expression: $\begin{matrix} (13) & y_{N 1}^{(4)} = \frac{μ_{1}}{μ_{1} + μ_{2} + μ_{3} + μ_{4}} = {\bar{μ}}_{1} . \end{matrix}$

The fifth layer is the defuzzification layer where the consequent parameters of the rules are determined. A defuzzification neuron calculates the weighted consequent value of a given rule using the expression: $\begin{matrix} (14) & \begin{matrix} y_{i}^{(5)} & = x_{i}^{(5)} [k_{i 0} + k_{i 1} x_{1} + k_{i 2} x_{2}] \\ = {\bar{μ}}_{i} [k_{i 0} + k_{i 1} x_{1} + k_{i 2} x_{2}] . \end{matrix} \end{matrix}$ Here $x_{i}^{(5)}$ is the input and $y_{i}^{(5)}$ is the output of defuzzification neuron i in fifth layer and $k_{i 0}$ , $k_{i 1}$ and $k_{i 2}$ is a set of consequent parameters of rule i. Finally the sixth layer computes the overall output using the summation of all incoming signals. The overall ANFIS output y is computed as follows: $\begin{matrix} (15) & y = \sum_{i = 1}^{n} x_{i}^{(6)} = \sum_{i = 1}^{n} {\bar{μ}}_{i} [k_{i 0} + k_{i 1} x_{1} + k_{i 2} x_{2}] . \end{matrix}$

3.2.2. Basic learning rule and definitions

Let assume that a given adaptive network has L layer with kth layer having $# (k)$ nodes. If we denote the ith position of the kth layer by $(k, i)$ and its node function or node output by $O_{i}^{k}$ . Since a node output depends on its incoming signals and its parameter set $(a, b, c)$ , we can express node output as: $\begin{matrix} (16) & O_{i}^{k} = O_{i}^{k} (O_{i}^{k - 1}, \dots, O_{# (k - 1)}^{k - 1}, a, b, c, \dots) . \end{matrix}$ Given a training data set with J entries, we can define the error measure as the sum of squares [15]: $\begin{matrix} (17) & E_{j} = \sum_{m = 1}^{# (L)} {(T_{m, j} - O_{m, j}^{L})}^{2} . \end{matrix}$ Here $T_{m, j}$ is the mth component of the jth target output vector and $O_{m, j}^{L}$ is the mth component of the actual output vector. Thus the overall error measure can be computed as follows: $\begin{matrix} (18) & E = \sum_{j = 1}^{J} E_{j} . \end{matrix}$ To develop a learning procedure using gradient decent in E over the parameters space, we can compute the error rate $\frac{\partial E_{j}}{\partial O}$ for the jth training data for each node output O. The error rate for the output node at $(L, i)$ can be computed as: $\begin{matrix} (19) & \frac{\partial E_{j}}{\partial O_{i, j}^{L}} = - 2 (T_{i, j} - O_{i, j}^{L}) . \end{matrix}$ For the internal node at $(k, i)$ , the error rate can be derived by using chain rule as following: $\begin{matrix} (20) & \frac{\partial E_{j}}{\partial O_{i, j}^{k}} = \sum_{m = 1}^{# (k + 1)} \frac{\partial E_{j}}{\partial O_{m, j}^{k + 1}} \frac{\partial O_{m, j}^{k + 1}}{\partial O_{i, j}^{k}}, \end{matrix}$ where $1 ⩽ k ⩽ L - 1$ . Thus the error rate of an internal node is the linear combination of the error rates of the nodes in the next layer. Considering α as a parameter of the given adaptive network, the error measure becomes: $\begin{matrix} (21) & \frac{\partial E_{j}}{\partial α} = \sum_{O^{*} \in S} \frac{\partial E_{j}}{\partial O^{*}} \frac{\partial O^{*}}{\partial α} . \end{matrix}$ Here S is the set of nodes whose outputs depend on α. Then the derivative of the overall error measure E with respect to α is $\begin{matrix} (22) & \frac{\partial E}{\partial α} = \sum_{j = 1}^{J} \frac{\partial E_{j}}{\partial α} . \end{matrix}$ Henceforth, given learning rate of η, we can update α can using the equation (23): $\begin{array}{l} (23) & Δ α = - η \frac{\partial E}{\partial α}, \\ (24) & η = - \frac{k}{\sqrt{\sum_{α} {(\frac{\partial E}{\partial α})}^{2}}}, \end{array}$ where k is the step size, the length of each gradient transition in the parameter space.

Fig. 2.

Study area with grid plan, camera network processing points, road network and zonal demarcations.

3.2.3. Hybrid learning rule

In adaptive network learning a combination of gradient method and least squares estimate to update the network parameters. Each epoch of this hybrid learning procedure is composed of a forward pass and a backward pass [15]. If we consider an adaptive network with only one output as: $\begin{matrix} (25) & output = F (I, S), \end{matrix}$ where I is the vector of input variables, S is the set of parameters, F is the function implemented by the ANFIS. If there exists a function H such that the composite function $H \circ F$ is linear in some elements of S then these elements can be identified by least squares method. In other words if the parameter set S can be decomposed into two sets $S = S_{1} \oplus S_{2}$ (⊕ means direct sum), such that $H \circ F$ is linear in the elements of $S_{2}$ . Then applying H to equation (25) we will give us: $\begin{matrix} (26) & H (output) = H \circ F (I, S), \end{matrix}$ which is linear in the elements of $S_{2}$ . Given values of elements of $S_{1}$ it is possible to plug J training data into (26) and obtain a matrix equation: $\begin{matrix} (27) & A X = B, \end{matrix}$ where X is an unknown vector whose elements are parameters in $S_{2}$ . Given that $| S_{2} | = M$ , where M is the number of linear parameters, then we will have the dimensions of A, X and B as $J \times M$ , $M \times 1$ and $J \times 1$ and respectively. Since J is always greater than M, there is no exact solution to equation (27). As such a Least Square Estimate (LSE) of X, $X^{*}$ can be deployed to minimize the squared error $‖ A X - B ‖^{2}$ . $X^{*}$ is computed using the pseudo-inverse of X: $\begin{matrix} (28) & X^{*} = {(\begin{matrix} A^{T} & A \end{matrix})}^{- 1} A^{T} B, \end{matrix}$ where $A^{T}$ is the transpose of A and ${(A^{T} A)}^{- 1} A^{T}$ is the pseudo-inverse of A and $A^{T} A$ is non-singular. If we represent the ith row vector of the matrix A in equation (27) as $α_{i}^{T}$ and ith element of matrix B as $b_{i}^{T}$ , then X can computed iteratively using the following expression: $\begin{matrix} (29) & \begin{matrix} X_{i + 1} = X_{i} + S_{i + 1} a_{i + 1} (b_{i + 1}^{T} - a_{i + 1}^{T} X_{i}), \\ S_{i + 1} = S_{i} - \frac{S_{i} a_{i + 1} a_{i + 1}^{T} S_{i}}{1 + a_{i + 1}^{T} S_{i} a_{i + 1}}, \\ i = 0, 1, \dots, J - 1, \end{matrix}\} \end{matrix}$ where $S_{1}$ is called the covariance matrix. The least squares estimate $X^{*}$ is equal to $X_{J}$ . The initial conditions to bootstrap equation (29) are $X_{0} = 0$ and $S_{0} = γ I$ where γ is a positive large number and I is the identity matrix of dimension $M \times M$ . For a multi-output adaptive network equation (29) is still applied but the output in (25) becomes a column vector [15]. Each epoch consist of a forward and backward pass. In the forward pass, the input data and functional signals compute the output of each node in a forward direction until the matrices A and B in equation (27) are obtained and the parameters in $S_{2}$ are identified using the sequential least squares formulae given in (29). After identifying parameters in $S_{2}$ , the functional signals keep going till the error measure is computed. In the backward pass, the error rates propagate from the output layer to the input layers, and the parameters in $S_{1}$ are updated using the gradient method given in (23). It should be noted that the above procedure is for offline learning. For online learning the squared error measure is formulated using a weighting strategy to in order to high priority to more recent data pairs. Thus a forgetting factor λ is added to equation (29): $\begin{matrix} (30) & \begin{matrix} S_{i + 1} = \frac{1}{λ} [S_{i} - \frac{S_{i} a_{i + 1} a_{i + 1} S_{i}}{λ + a_{i + 1}^{T} S_{i} a_{i + 1}}], \\ i = 0, 1, \dots, J - 1 . \end{matrix} \end{matrix}$ λ is between 0 and 1. The smaller the λ is, faster the effects of old data decay.

3.3. Crime mapping

To visualize crime imminence levels and detect areas of high crime imminence, KDE is used. To do this, a simulated study area of 130 km² is created as shown in Fig. 2. The area is divided into 5 zones with some terrain features like roads. To deploy a surveillance camera network in the study area, a grid network with blocks of 1000 m² is overlaid on the area. Each grid block is then subdivided into $4 (n \times n)$ mini-blocks and one video camera is assigned to each mini-block representing the real time video source. A local processing point (LPP) is created at the center of each grid block of interest. The cameras of each grid block transmit video (surveillance footage) to their respective LPP’s (in real time) in a star topology fashion. Video analysis and crime prediction are performed at the LPP level. And the LPP output is finally transmitted to a central processing station (CPS) where crime mapping is performed.

4. Experiments and results

4.1. Data collection

For the experiments violent scene detection (VSD) 2014, Hollywood Human Action (HOHA) and HMDB datasets were used. The VSD2014 datasets features diverse outdoor scenes and events that bear close alignment to real-world scenarios [25]. The dataset is collected from various sources including user-generated videos shared on the web and popular Hollywood movies ranging from very violent ones to nonviolent ones. HOHA contains short video clips from 32 movies [21]. HMDB is collected from various sources, mostly movies. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips [20].

4.2. Video analysis experiments

To begin, 10 crime indicator events are defined and 1200 short clips from VSD datasets, 1500 short clips from HOHA and 1000 short clips from HMBD are sampled based on the event definitions.. The clips are of different lengths ranging from 3 minutes to 5 minutes and they are sampled at a framerate of 20 Hz with a pixel resolution of 640 × 480. The clips are annotated and used for training and validation. Annotation is done at video frame level and all moving objects are marked and tracked by bounding boxes.

Table 1
Comparison of average accuracy (%) results for GMCP and SVM concept detectors using different descriptors

Individual SVM Detectors GMCP

HOGHOF 60.74% 58.05%

MBH 72.24% 72.58%

MBH + HOG + HOF 74.92% 78.41%

Cuboids 81.10% 87.50%

	Individual SVM Detectors	GMCP
HOGHOF	60.74%	58.05%
MBH	72.24%	72.58%
MBH + HOG + HOF	74.92%	78.41%
Cuboids	81.10%	87.50%

Fig. 3.

Comparison of different concept detectors using different descriptors on our sampled dataset.

To model the concept detectors, cuboids [8] features were used based on experimental results shown in Table 1. A feature codebook using cuboids features extracted from the annotated clips was constructed using K-mean clustering. K-means was initialized 10 times and the results with the lowest errors were kept. The number of visual words was fixed at 1500 and it demonstrated good results. Finally 112 binary SVMs with $χ^{2}$ RBF kernel are trained.

4.2.1. Evaluation of concept detection methods

To evaluate the proposed GMCP-based concept detection method, 10-fold leave-one-out cross validation approach was used. As such the reference co-occurrence matrices were extracted utilizing 9 folds of the 3700 sample annotated clips from VSD dataset and the remaining were used for validation. As a baseline 112 individual SVM concept detectors are applied to each annotated clips and the class with the highest confidence is picked as the detected concept. To make comparison with other methods, Linear Chain Conditional Random Fields (CRF) [16] and Discriminative Model Fusion (DMF) [13] were also implemented. Figure 3 shows the average accuracy of the various concept detectors using different descriptors.

From Fig. 3 it can be seen that GMCP performs better than the other concept detectors in all cases and the best performance was achieved using cuboids descriptors. It should be noted that in all cases the parameters were tuned to obtain the best performance results.

4.2.2. Evaluation of the classification methods

To make comparisons between of the GMCP based classification method and other well-known classification methods, a multiclass SVM [23,24] and k-nearest neighbor classifiers (k-NNs) [19,29] were also implemented. Figure 4 is a chart showing the average classification accuracy for GMCP classifier against the other classifiers. Figures 5, 6 and 7 are the confusion matrices for SVM, K-NN and GMPC classifiers respectively.

The bar chart in Fig. 4 shows the average classification accuracy for GMCP classifier against a Multiclass SVM classifier and K-NN classifier. From the bar chart it can be seen that the GMCP classifier outperforms the other classifiers. The average accuracy values are 76.6%, 77.2% and 87.4% respectively for the SVM, K-NN and GMCP classifiers. It can also be observed that the GMCP classifier improved significantly with increasing number of clips per class.

The confusion matrix in Fig. 5 shows percentages per true class for SVM classifier after performing 10-fold cross validation. The blue diagonal contains both the percentage and the number of correctly classified clips in each class. The red cells outside the diagonal contain percentage of misclassified clips in each class. From the figure it can be observed that the SVM classifier recorded its best performance in Verbal Threat class with a score of 95.8% and its worst performance in Loitering class with a score of 13.3%.

Figure 6 shows percentages per true class for K-NN classifier after performing 10-fold cross validation. Unlike the SVM classifier, the best performance of K-NN is recorded in Camera Tampering class (with a score of 99.2%) and worst performance is in Forceful Seizure class (with a score 46.7%). More importantly, with K-NN some improvements can be observed in the classes where SVM performed worst.

Figure 7 shows the percentages per true class for the GMCP classifier after 10-fold cross validation. It can be seen that there are significant improvements over both SVM and K-NN Classifiers. For instance, in the class where K-NN had its worst misclassification, GMCP had one of its best performances in that class (with a score of 96.7%). However, there are some observable similarities in the performance of all the three classifiers in the sense that they all recorded low values in Forceful Seizure and Suspicious Exits classes.

4.3. Crime prediction experiments

The experiments in this section are performed to model the crime predictive fuzzy inference system. This allows the extraction of the initial fuzzy rules from the training data and the tuning of the membership function parameters through adaptive learning. Four types of crimes are considered namely: robbery, burglary, larceny and battery but due to space constrain we will only show the experiments of robbery category since the procedure is similar for the rest. To start, the fuzzy inputs and output are extracted from the annotated training videos and 60% is used for training and 40% for model checking and validation. Figure 8 is a plot of the fuzzy training inputs and output.

Fig. 4.

Comparing the average accuracy of classifiers.

Fig. 5.

Confusion matrix for SVM.

Fig. 6.

Confusion matrix for K-NN.

Fig. 7.

Confusion matrix for GMCP.

Fig. 8.

Plot of fuzzy inputs and fuzzy output for robbery class.

To construct the fuzzy inference system (FIS) model from the data, subtractive clustering is first performed using different clustering radii for different input variables (from 0.5 to 0.25). After obtaining the initial FIS model, the root mean square error is computed and used as a baseline for comparisons during adaptive learning. The system is then tested using the checking data and the results are shown in Fig. 9(a) and (b).

Fig. 9.

(a) Model before training. (b) Model after 40 epochs of training.

Figure 9(a) is plot of the untrained model output against the checking data. After observing the model performance in Fig. 9(a) where the checking error above is 200, the first ANFIS optimization experiment is performed using 40 training epochs with an error target of 0 and an initial step size of 0.1. Figure 9(b) is the plot of the initially improved model output against the checking data. As can be seen in Fig. 9(b) the checking error is now below 100. To further optimize the model and also test for overfitting, the model is trained again with 200 epochs and the results are shown in Figs 10(a) and (b).

Fig. 10.

(a) Testing for overfitting. (b) Comparison of initial and final Improved Models.

Figure 10(a) is the plot of the improved model output against the checking data. From the figure, it can be seen that there is no overfitting because the model is not so fitted to the checking data. This means the fuzzy system is able to generalize and can therefore be used to make predictions with different data. Figure 10(b) is the comparison of the initial trained model and the final model against the checking data.

Fig. 11.

Error plots for (a) Training, (b) checking.

Figure 11(a) is the plot of the training error. The lowest training error is 0.6968 which occurs around the 23rd epoch point after which it becomes steady. Figure 11(b) is the plot of the checking error and the lowest checking error which is 2.8751 occurs at the 170th epoch, after which it remains steady even as ANFIS tries to minimize the error till the 200th epoch point. It can also be observed that there have been some great improvements with the checking error because the initial checking error was 241.6922 which have been reduced to 2.8751. Hence the plot indicates that the model has the ability to generalize over the checking data. Figure 12 is the generated ANFIS diagramed and Fig. 13 is the modeled FIS structure.

Fig. 12.

Crime prediction ANFIS diagram.

Fig. 13.

Robbery Imminence Prediction Fuzzy Inference System (RIPFIS) Diagram. RIPFIS is a Sugeno-type inference system with 7 inputs, 33 rules and 1 output as can be seen from the figure.

The generated diagram has five layers, with the first layer being the input layer where neurons simply pass external crisp signals to the next layer. The second layer contains the input membership functions and the neurons in this layer perform fuzzification. The third layer is used to compute the rule antecedent. The rule neurons receive inputs from their respective fuzzification neurons which are then used to calculate the firing strength of each rule. The conjunction of the rule antecedents is evaluated by the “AND” operator indicated in blue in Fig. 12. The fourth layer contains the output membership functions. This layer performs defuzzification and determines the consequent parameters of the rules. The fifth layer computes the overall output using the summation of all incoming signals. After adaptive learning, a Robbery Imminence Prediction Fuzzy Inference System (RIPFIS) is created as shown in Fig. 13. RIPFIS is a Sugeno-type inference system with 7 inputs, 33 rules and 1 output.

Figure 14 is the plot of the input membership functions used to model the universe of discourse for the Robbery category. Figure 15(a) to (f) are surface plots of RIPFIS showing the functional relationship between Robbery Imminence (in y-axis) and the rest of the input variables using Robbery Rates as a constant x-axis variable. The plots are created on a $15 \times 15$ (x-axis and y-axis) grid lines and plot points of 101.

Fig. 14.

Plot of membership functions for RIPFIS input variables. The system uses a Gaussian membership function for all the inputs variables and a linear membership function for the output variable.

Fig. 15.

Plots of output surface.

Figure 15(a) is the output surface of robbery rates and time of incident. From the figure it can be seen that robbery imminence is high between the hours of 10 to 15 (GMT) with a rate of 2 to 6 robberies. From Fig. 15(b) we can see that robbery imminence tend to increase as violent movements increases from 10 to 30 counts. From Fig. 15(c) it can be seen that, with 5 to 10 item seizures there is an imminence level of 2 robberies if the robbery rate in the area is below 4. From Fig. 15(d) it can be seen that robbery imminence assumes a fairly increasing trend as sneaking activities increases. From Fig. 15(e) it can be observed that the presence of lethal objects has more influence on robbery imminence than robbery rates. From Fig. 15(f) we can make the deduction that robbery rates have more influence on robbery imminence than traffic violations.

Fig. 16.

KDE map per square kilometer using VSD1 sampled data.

Fig. 17.

KDE map per square kilometer using VSD2 sampled data.

4.4. Crime mapping experiments

The experiments in this section are performed to visualize crime imminence across the study area. To do this 2 separate query datasets named VSD1 and VSD2 are sampled from VSD dataset. Each query dataset contains 800 clips and each clip has at least 100 tracks. The clips are assigned to designated grid blocks and the tracks of each clip are geo-referenced to mini-grids (within the designate grid blocks) in a sequential manner. On the whole each grid has 8 clips and each mini-grid has at least 200 tracks since there are 100 designate grids of interest and 4 mini-grids within each designated grid. Hence the mini-grids represent the primary surveillance sources hosting the input videos. For the “time of incident” fuzzy inputs, the time stamp on the video frame at the spatio temporal interest point is taken. But for simplicity sake, the time is normalized on hourly bases. And for the “crime rates” fuzzy inputs, predetermined values are assigned based on expert discretion.

To perform the mapping experiments, the query datasets are deployed separately on the study area in two consecutive sessions. In each session, video analysis is first performed and then four concurrent fuzzy inference systems are used to compute the crime imminence for each grid cell. It should be noted that the four fuzzy inference systems represents the crime categories, thus at each location four different crime types are computed. A KDE map and a prediction statistics map are generated at the end of each experimental session as shown in Figs 16, 17 and 18. Before executing KDE, the distribution of the crime imminence data across the study area is first examined and standard deviation classification (grouping classes with similar values) approach is adopted since the interest here is to identify clusters with very high values.

Figure 16 is the KDE map using the first sampled dataset (VSD1) and Fig. 17 is the KDE map using the second sampled dataset (VSD2). Standard deviation classifications and a search radius of 1000 m were used in constructing both figures. The darkest spots in the figures represent predicted regions of high crime imminence.

Fig. 18.

Descriptive statistics of predicted crimes.

Figure 18 is a descriptive map showing histograms of predicted crime types at various locations. To make prediction at a particular location, a cut-off threshold of average crime imminence is set at 5. Hence any crime type with average imminence levels greater than 5 is predicted as an imminent crime in that location (it should be noted that the locations are defined by grid blocks). From Fig. 18, three types of crimes have been predicted for various locations with histogram bars indicating the strength of the predictions.

5. Discussion

The experiments were mainly focused on depicting regions of high crime imminence and identifying crimes types with average imminence levels greater than a set threshold. But in real world deployment, further explorations can be done to determine the behavior of crime concentrated regions (hotspots), which could be acute or chronic. Determining the behavior and underlying factors in a crime imminence region is of great importance because it informs the kind of policing intervention that should be taken. For example, there can be situations where visible police presence can rather provoke or trigger crime than prevent it. Such situations may involve gangsters and psychopaths who may simply take pleasure in committing crimes just to defy the presence of law enforcement officers. So depending on the dynamics and the underlying factors police may decide to go undercover or not.

More importantly the framework proposed in this study can be used to develop a real time field decision support system for police patrols. The system can be hosted on a secured network so that police on patrol duty can interactively access real time crime imminence statistics using tablets or smart phones. Besides that, such a system can also help in giving a visual impression of how policing actions are impacting the crime atmosphere in an area.

6. Conclusion

The paper presented a comprehensive approach to crime prediction. The approach exploited the prospects of video surveillance in a dimension that could be useful in developing wide area crime early warning systems, crime hotspot monitoring system and a police patrol decision support system. Principally the framework is grounded on theories of criminal behavior, so a system developed based on this framework will work best in predicting crimes that fall under the rational choice, activity routine and crime pattern theories.

For feasibility testing, the framework was implemented in a simulation study area. A wide area surveillance camera network was created and used to host sampled VSD datasets on the study area. Neuro-fuzzy inference systems were built based on the dataset hosted on the network. For the visualization of regions of high crime imminence, KDE was used. On the whole the experiments were demonstrative and going forward, the intention is to implement the framework in a real world case study.

Footnotes

Acknowledgements

This research is supported by Natural Science Foundation of China (61173122).

References

Aghababaei and

Makrehchi, Mining social media content for crime prediction, in: 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2016, pp. 526–531. doi:10.1109/WI.2016.0089.

Althaus,

Kohlbacher,

Lenhof and

Muller, A combinatorial approach to protein docking with flexible side chains, in: RECOMB, 2008.

Chen,

Cho and

S.Y.

Jang, Crime prediction using Twitter sentiment and weather, in: 2015 Systems and Information Engineering Design Symposium, 2015, pp. 63–68. doi:10.1109/SIEDS.2015.7117012.

Cheng,

Yang,

Tang,

Mao,

Luo,

Li and

Wang, Distributed indexes design to accelerate similarity based images retrieval in airport video monitoring systems, in: 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2015, pp. 1908–1912. doi:10.1109/FSKD.2015.7382239.

L.E.

Cohen and

Felson, Social change and crime rate trends: A routine activity approach, American Sociological Review 44 (1979), 588–608, https://www-jstor-org.web.bisu.edu.cn/stable/2094589 . doi:10.2307/2094589.

Dawei,

Wei,

Henry,

Melissa,

Ping,

Josue and

Tomasz, Understanding the spatial distribution of crime based on its related variables using geospatial discriminative patterns, Computers, Environment and Urban Systems 39 (2013), 93–106. doi:10.1016/j.compenvurbsys.2013.01.008.

Detlef and

Kruse, Neuro-fuzzy systems for function approximaton, Fuzzy Sets and Systems 101 (1999), 261–271, http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6740 . doi:10.1016/S0165-0114(98)00169-9.

Dollar,

Rabaud,

Cottrell and

Belongie, Behavior recognition via sparse spatio-temporal features, in: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. doi:10.1109/VSPETS.2005.1570899.

Eckert-Gallup and

Martin, Kernel density estimation (KDE) with adaptive bandwidth selection for environmental contours of extreme sea states, in: OCEANS 2016 MTS/IEEE Monterey, 2016, pp. 1–5. doi:10.1109/OCEANS.2016.7761150.

10.

Z.B.

Ehsan,

Afshin,

Massimo and

Mubarak, Complex event recognition by latent temporal models of concepts, in: IEEE International Conference on Image Processing (ICIP), Paris, 2014, pp. 2373–2377. doi:10.1109/ICIP.2014.7025481.

11.

Feremans,

Labbe and

Laporte, Generalized network design problems, European Journal of Operational Research 148(1) (2003), 1–13. doi:10.1016/S0377-2217(02)00404-6.

12.

M.S.

Gerber, Predicting crime using Twitter and kernel density estimation, Decision Support Systems 61 (2014), 115–125. doi:10.1016/j.dss.2014.02.003.

13.

Iyengar and

H.J.

Nock, Discriminative model fusion for semantic concept detection and annotation in video, in: Proceedings of the Eleventh ACM International Conference on Multimedia, 2003, p. 255–258. doi:10.1145/957013.957065.

14.

Izadinia and

Shah, Recognizing complex events using large margin joint low-level event model, in: Proceedings of the 12th European Conference on Computer Vision (ECCV), Part IV, 2012, pp. 430–444. doi:10.1007/978-3-642-33765-9_31.

15.

J.-S.R.

Jang, ANFIS: Adaptive-network-based fuzzy inference systems, IEEE Transactions on Systems, Man, and Cybernetics 23 (1993), 665–685. doi:10.1109/21.256541.

16.

A.L.W.

Jiang and

Chang, Context-based concept fusion with boosted conditional random fields, in: International Conference on Acoustics, Speech and Signal Processing, Vol. 1, 2007. doi:10.1109/ICASSP.2007.366066.

17.

Jingen,

Qian,

Omar,

Saad,

Amir,

Ajay,

Hui and

Harpreet, Video event recognition using concept attributes, in: 2013 IEEE Workshop on Applications of Computer Vision (WACV), 2013, pp. 339–346. doi:10.1109/WACV.2013.6475038.

18.

Johansson,

Gåhlin and

Borg, Crime hotspots: An evaluation of the KDE spatial mapping technique, in: 2015 European Intelligence and Security Informatics Conference, 2015, pp. 69–74. doi:10.1109/EISIC.2015.22.

19.

Kaghyan and

Sarukhanyan, Activity recognition using k-nearest neighbor algorithm on smartphone with tri-axial accelerometer, International Journal on Information Models and Analyses 1 (2012), 146–156.

20.

Kuehne,

Jhuang,

Garrote,

Poggio and

Serre, HMDB: A large video database for human motion recognition, in: ICCV, 2011.

21.

Laptev,

Marsza,

Schmid and

Rozenfeld, Learning realistic human actions from movies, in: 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.

22.

Nurudeen,

Zou,

Zhu and

Zhao, Crime hotspot detection and monitoring using video based event modeling and mapping techniques, International Journal of Computational Intelligence Systems. 10 (2017), 962–969. doi:10.2991/ijcis.2017.10.1.64.

23.

Schuldt,

Laptev and

Caputo, Recognizing human actions: A local SVM approach, IEEE International Conference on Pattern Recognition 3 (2004), 32–36.

24.

M.A.

Shayan,

R.Z.

Amir and

Mubarak, Video classification using semantic concept co-occurrences, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014, pp. 2529–2536. doi:10.1109/CVPR.2014.324.

25.

Sjöberg,

Ionescu,

Y.G.

Jiang,

V.L.

Quang,

Schedl and

C.H.

Demarty, The MediaEval 2014 affect task: Violent scenes detection, in: Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Spain, 2014.

26.

Wang and

M.S.

Gerber, Using Twitter for next-place prediction, with an application to crime prediction, in: 2015 IEEE Symposium Series on Computational Intelligence, 2015, pp. 941–948. doi:10.1109/SSCI.2015.138.

27.

K.-R.

Wu,

J.-M.

Liang,

Zhang,

K.-Y.

Li,

Y.-T.

Lin and

Y.-C.

Tseng, Smart surveillance with context and location sensitivity and quality control, in: 2016 IEEE 13th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), 2016, pp. 363–364. doi:10.1109/MASS.2016.055.

28.

A.R.

Zamir,

Dehghan and

Shah, GMCP-Tracker: Global multi-object tracking using generalized minimum clique graphs, in: Computer Vision – ECCV 2012, 2012, pp. 343–356.

29.

Zhang and

Chen, Design of a monitoring system of airport boarding bridge based on ZigBee wireless network, in: 26th Chinese Control and Decision Conference (2014 CCDC), 2014, pp. 2486–2491. doi:10.1109/CCDC.2014.6852591.

30.