Abstract
Crowd panic detection (CPD) is crucial to control crowd disasters. The recent CPD approaches fail to address crowd shape change due to perspective distortion in the frame and across the frames. To this end, we are motivated to design a simple but most effective model known as multiscale spatial-temporal atrous-net and principal component analysis (PCA) guided one-class support vector machine (OC-SVM), i.e., MuST-POS for the CPD. The proposed model utilizes two multiscale atrous-net to extract multiscale spatial and multiscale temporal features to model crowd scenes. Then we adopted PCA to reduce the dimension of the extracted multiscale features and fed them into an OC-SVM for modeling normal crowd scenes. The outliers of the OC-SVM are treated as crowd panic behavior. Three publicly available datasets: the UMN, the MED, and the Pets-2009, are used to show the effectiveness of the proposed MuST-POS. The MuST-POS achieves the detection accuracy of 99.40%, 97.61%, and 98.37% on the UMN, the MED, and the Pets-2009 datasets, respectively, and performs better to recent state-of-the-art approaches.
Introduction
The effectiveness of the proposed model Asymptotic increase of the worldwide population makes crowd monitoring a tedious task and is frequently open to crowd hazards. Such reasons lead to significant demand for automatic surveillance systems for crowd behavior monitoring. Many aspects of crowd behavior like crowd count and density estimation [1, 2], crowd congestion-level analysis [3], and crowd anomaly detection [4–25] have been explored in the literature. Among these applications, crowd anomaly detection has been vastly explored by researchers to provide different models to control crowd hazards. For crowd anomaly detection, the state-of-the-arts utilized handcrafted [4, 20] or deep features [6–15, 21–25]. The crowd anomaly detection has been solved as a one-class classification (OCC) problem. In the OCC problem, the model learns features from the normal crowd scenes, and the outliers are treated as anomalies or panic. One class classifier, such as one class Extreme Learning Machine (OC-ELM), one class support vector machine (OC-SVM), and one class Gaussian classifier, has vastly been used in the literature. Most of the existing crowd anomaly datasets: the UMN [26], the MED [27], the Pets-2009 [28], have crowd panic situations as common. The UCSD [29] crowd anomaly dataset contains different variants of crowd anomaly situations like the entry of vehicles, cyclists in the no-entry zone, etc. However, the availability of anomaly situations is minimal, and the state-of-the-art approaches provide different solutions using a one-class classification approach, thereby neglecting the dissimilarity among different crowd anomalies. The available crowd panic-like anomaly datasets (the UMN, the MED, the Pets-2009) have similar crowd panic behavior patterns (i.e., crowd escaping) and are also captured in real-world scenarios. So, based on these facts, this paper proposes a solution for crowd panic detection.
The most recent crowd panic detection (CPD) approaches [30–35] focus on modeling handcrafted or deep motion features from the crowd scene. Singh et al. [30] exploits probabilistic weighted optical flow (OF) magnitude and orientation histograms from motion attributes of the frames to model crowd dynamics. Authors [30] adopted one class extreme learning machine (OC-ELM) for crowd panic detection. On the other hand, Ammar et al. [31] developed a hybrid model using holistic gradients of motion attributes followed by an LSTM for modeling crowd scenes. The authors proposed a prediction error-based crowd panic detection approach. Aldissi et al. [32] analyzed the interaction of moving edges of crowd scenes in the frequency domain such as FFT and DCT. Authors [32] proposed a clustered-based approach to distinguish between normal and panic behavior. Shehab et al. [33] examined the crowd dynamics from spatial gradients of the motion fields obtained using the Horn and Schunck optical flow algorithm. Authors [33] used a minimum covariance determinant estimator (MCD) for crowd panic detection. Fradi et al. [34] combines crowd density information with motion information to recognize panic behavior. Wu et al. [35] proposed a Bayesian framework for crowd motion modeling, including motion magnitude and direction, to detect crowd panic behavior. Although the state-of-the-art [30–35] crowd panic detection models adopt different ways to utilize motion attributes in the crowd scene but possess the following shortcomings. State-of-the-arts [30–35] do not address human shape variations due to perspective distortion in the crowd scene. Figure 1 shows an example of a scale issue due to perspective distortion in a frame and across frames of the Pets-2009 crowd panic dataset. The state-of-the-art [29–34] rely on temporal or motion attributes for crowd panic detection but neglects the spatial features of the crowd scenes.

Example of crowd shape change due to perspective distortion in the Pets-2009 crowd panic dataset.
From the related research domain like crowd counting [36], scale-invariant features, also known as multi-scale features, were utilized to deal with human shape change due to perspective distortion in the crowd scene. Authors [36] extracted multi-scale spatial features from different layers of a multi-layer CNN to deal with human shape change due to perspective distortion. However, the scale variations can be seen in the frame and across the frames (See Fig. 1). Hence, multi-scale spatial and temporal features are essential for crowd panic detection to deal with crowd shape change due to perspective distortion.
Hence by adopting the idea of SaCNN [36] and extending it to multi-scale spatial-temporal features modeling, we are motivated to design an effective deep model, i.e.,
The main contributions are as follows, Designing a two-stream multiscale 3D atrous convolution neural network (CNN) to extract scale-relevant spatial-temporal features (i.e., multiscale features) from the crowd scene. For minimizing computational complexity, the PCA has been used to reduce the dimension of the extracted multiscale features. An OC-SVM has been utilizing to model crowd scenes for detecting crowd panic behavior. Extensive experiments and ablation study have been conducted on three publicly available large-scale crowd panic datasets to show the effectiveness of the proposed model.
The organization of the paper is as follows: section-2 discusses the literature review, section-3 explains the proposed work, the details of datasets and the performance metrics are discussed in the section-4, section-5 explains the experiment and analysing results, the effect of each module of the proposed model is discussed in the section-6 and finally the conclusion, and future work is explained in the section-7.
A brief literature review is conducted on the research papers based on abnormal event detection (including crowd panic) and crowd panic detection in this section. Nevertheless, most of these approaches follow a non-object-centric approach in which the model tries to learn the features to model normal behavior patterns, and the outliers are treated as abnormal or panic events. Both traditional [4, 20] and deep-learning approaches [6–15, 21–25] have been explored to achieve the objective function. Among traditional approaches, Lu et al. [4] proposed a sparsity-based crowd abnormality detection which utilizes the inherent redundancy between the video frames and achieves a frame rate of around 150 fps. Cheng et al. [5] proposed a hierarchical feature representation framework for local and global anomaly detection. Saligrama et al. [19] proposed a statistical approach to exploit spatial-temporal features for crowd anomaly detection. Recently Lamba et al. [20] proposed a trajectory-based approach for crowd anomaly detection.
The deep learning approaches: CNN [6, 21–25], sequential models [7], generative models (Autoencoders or encoder-decoder models [8–13], Generative Adversarial Networks [14, 15]), and hybrid models [17, 18] have been vastly explored in the literature. Most of these approaches detect anomaly/panic at frame level or pixel level.
One-Class Extreme Learning Machine (OC-ELM) [30], Gaussian classifier [8, 37], OC-SVM [9, 38], reconstruction error [10, 18], reconstruction error-based regularity-score [18], similarity score between the target and balanced distribution [25] are adopted as classifier for anomaly or panic detection. Zhou et al. [21] designed a spatial-temporal CNN for both frame and pixel-level anomaly detection. The Spatial-Temporal CNN was designed to extract features from spatial and temporal dimensions from the volume of patches. Bouindour et al. [22] utilized first two convolution layers of a pre-trained Alex-Net [39] to obtain spatial-temporal features from the volume of patches/frames. Then authors utilize the OC-SVM to train the normal frames, and the abnormal events are the outliers. Ravanbakhsh et al. [24] proposed to use a plug-and-play CNN model for crowd anomaly detection. Authors [24] extracted both semantic and motion features from the crowd video to detect local anomalies. The advantage of the plug-and-play CNN is it does not need any finetuning. Bouindour et al. [25] exploited spatial-temporal features from the volume of frames using a modified pre-trained residual 3D-CNN. Song et al. [6] proposed a 3D-CNN for video anomaly events like violent activities from the video.
A sequential model like bidirectional-LSTM (BDLSTM) has been designed for violent like crowd anomaly detection. Authors extracted histogram of oriented gradients (HOG) features from the frames and fed them into the BDLSTM model for real-time detection of crowd violent like anomalous events in the football stadium.
Generative models like variants of autoencoders or encoder-decoder models and GANs have been exploited for crowd anomaly/panic detection. Sabokrou et al. [8] extracted a set of local and global descriptors for crowd anomaly detection as well as localization in real-time. The global descriptors are obtained using a sparse autoencoder. Two Gaussian classifiers are used for anomaly detection and localization. Xu et al. [9] proposed stacked denoised autoencoders to exploit appearance and motion cues from normal crowd scenes and adopted both early and late fusion followed by OC-SVM for anomaly detection. George et al. [10] exploit the HOFM features from the parallelepipeds of non-uniform spatial-temporal regions from the normal crowd videos. The extracted feature vectors are then fed into an autoencoder model for the detection of abnormal/panic events.
Tran et al. [11] proposed a convolutional autoencoder model to extract motion-related features from the normal crowd videos. The authors utilized OCSVM to detect crowd anomalies. The encoder-decoder model using CNN [12] has also been explored in the literature. Chong et al. [12] proposed a spatial-temporal autoencoder using CNN to train the normal crowd sequences. The abnormal events are detected based on the reconstruction error. Sabokrou et al. [13] proposed a deep cascade of 3D CNN-based autoencoders for crowd anomaly detection.
Ravanbakhsh et al. [14] proposed to use GANs for detecting crowd anomalies at frame level and pixel level. Authors exploit inherent motion patterns using GAN for normal crowd frames, and the abnormal events are detected based on the reconstruction error. Again, Ravanbakhsh et al. [15] utilize GAN to detect cross channel frame-level as well as pixel-level crowd anomaly detection. The model is trained using normal events, and the abnormal events are detected based on the reconstruction error.
Many hybrid models have also been explored for anomaly/panic detection. For example, Zhuang et al. [17] proposed a deep end-to-end network with CNN (Inception-V3) and stacked differential LSTM for understanding crow scene-like violent-based crowd abnormal events. Yang et al. [18] proposed a CNN-based autoencoder LSTM model for crowd anomalous event detection.
Several models are also proposed have been done for crowd panic detection only. Recently, Ammar et al. [31] proposed a real-time detection framework for crowd panic. The authors proposed a hybrid kind of model in which they extracted handcrafted features from the crowd video followed by an LSTM model to capture the temporal dependencies between frames. The model is trained with normal frames. The frames will be treated as panic based on the prediction error. Aldissi et al. [32] proposed a frequency domain based crowd panic detection model which can process the frames in real time. A clustered-based detection approach was proposed to detect crowd anomalies. Shehab et al. [33] proposed a statistical feature learning approach to differentiate between crowd panic and non-panic events. However, the recent state-of-the-art approaches fail to address scale variation due to perspective distortion in the video. Moreover, the accuracy of the model needs to be improved. Based on these findings, we are motivated to design a very simple but effective MuST-POS model for the CPD. The following section explains the model in detail.
Proposed work
The overall block diagram of the proposed model is illustrated in the following Fig. 2 and the detail of the proposed MuST-POS model is displayed in the following Fig. 3. Table 1 shows the details of the blocks used in the MuST-POS.

Overall block diagram of the proposed model.

The architecture of the proposed MuST-POS.
Block details of the MuST-POS
The following subsections will explain the proposed work in details, Architecture Details of MuST-3AN. Pre-Processing. Multiscale Appearance (spatial) Temporal feature Extraction. Dimension Reduction. Crowd Panic Detection using OC-SVM.
The following subsections will explain the proposed work in details, Architecture Details of MuST-3AN. Pre-Processing. Multiscale Appearance (spatial) Temporal feature Extraction. Dimension Reduction. Crowd Panic Detection using OC-SVM.
According to Fig. 3, the proposed MuST-3AN mainly has two streams: The multiscale appearance stream (MAS) and the multiscale temporal stream (MTS). Both MAS and MTS are built on the same number of 3D atrous blocks. Each of these two streams has seven such blocks but different in the number of kernels. The details of these blocks are mentioned in Table 1. Each of the 3D_Atrous blocks has three different layers: a dilated Conv3D layer, a Leaky ReLU activation layer, and a batch normalization (BN) layer. Each stream has four 3D average pooling layers (AP) at different stages of the network for performing multiscale analysis. The AP layer downscale the incoming features map to its half of size. The network utilizes the 3D Global Average Pooling (GAP) at different levels of the network to extract the global average of feature maps. We have used zero padding in the atrous and average pooling layers.
Pre-processing
In the pre-processing stage, we have extracted frames and the volume of frames from the video. The frames are rescaled to [200 × 200 × 3]. Let the N number of rescaled frames are denoted by a set S ={ s1, s2, … . . , s N }. Similarly, let the N number of volume of frames are denoted by a set V ={ v1, v2, … . . , v N }. The volume of frames is obtained by stacking grayscale frames from timestamp t, t - 1, t - 2. Each volume element of set V is rescaled to [200 × 200 × 3].
Multiscale spatial-temporal feature extraction
The MAS is designed to extract multiscale spatial features from the frame. The multiscale features can be used to handle scale variation due to perspective distortion. The set S is inputted into the MAS. The MAS has multi-stages of atrous 3D convolution layers. The reason for using the dilated 3D CNN is to cover a larger area on the image or feature map by keeping the kernel parameters the same as the normal convolution layer. For multiscale analysis, we have used four AP layers in four different stages of the MAS module. The activated feature maps from the 3D_Atrous_Block12, 3D_Atrous_Block14, 3D_Atrous_Block15, 3D_Atrous_Block16, and 3D_Atrous_Block17 are used for multiscale spatial/appearance feature analysis. The selected feature maps at different scales are fed into GAP layers for obtaining statistical features. Let the features obtained from the GAP layers are denoted as
The multiscale temporal features are extracted by using the MTS module. The structure of MTS is the same as the MAS except for the number of kernels. The set V is inputted into the MTS module. The activated feature maps from the 3D_Atrous_Block22, 3D_Atrous_Block24, 3D_Atrous_Block25, 3D_Atrous_Block26, and 3D_Atrous_Block27 are used for multiscale temporal feature analysis. GAP layers follow the selected multiscale temporal features to extract statistical features like mean from the input feature maps. Let the multiscale mean features are represented as
Now, these two multiscale spatial-temporal features are concatenated and represented as
The model must now be minimized, i.e.,
To solve the above optimization problem, we have used backpropagation with Adam optimizer [40]. The model is trained with a learning rate of 0.001. We have adopted early stopping to halt the network and also to avoid overfitting.
After training the MuST-3AN on the normal crowd videos, we have extracted the fused multi-scale features (f) for panic detection for the video frames. Let the multi-scale feature matrix for the video sequence is denoted as FN×L, where N is the total number of sequences and the L represents the dimension of the multi-scale feature of each frame. The concatenated multi-scale spatial-temporal features (f) are one-dimensional vectors. The same feature vector can be given to the OC-SVM for crowd panic detection, but the overheads of computational complexity for OC-SVM could be increased. Hence, by considering this fact, we have used PCA to reduce the dimension of the multi-scale spatial-temporal features for the CPD.
The PCA can be defined as an orthogonal transformation of a set of features of one coordinate system to features to another coordinate system in such a way that the descending order of variances (obtained by performing some scalar projection on the feature set) lie on the ascending order (1st, 2nd and so on) of coordinates. The principal decomposition of FN×L can be formulated as,
Where F is the N × L multiscale feature matrix, and W is the q × q square matrix. The columns of W are the eigenvectors of F
T
F. The W is also known as sphering transformation. Now, by admiring first l principal components (obtained from first l eigenvectors), we can have the truncated transformation of Equation 3 as
Here the dimension of T l is N × l. Empirically, we set the dimension of the PCA to 128.
The dimensionally reduced feature set is given in to the OC-SVM [41] for modeling the normal crowd scenes. To train the OC-SVM, we have got only one class feature map. Generally, the OC-SVM tries to maximally separate the distance from the hyperplane (dimensionally reduced feature space) to its origin. In this way, a binary function is learned, which captures a region that surrounds the input data of one normal crowd scene, and it returns one if the data points lie within this region otherwise it returns -1, also known as outliers or panic. To maximize the distance between the hyperplane of the reduced feature space and its origin, we must optimize the basic problem’s quadratic programming, which is defined as [41].
Here k i is the training data corresponding to i th sequence of N normal sequences. The mapping function φ () maps the primitive feature space to a higher dimension. The ω is normal to the hyperplane of the feature space. v ∈ [0, 1] is the upper bound of the outlier (panic sequences), and ξ is the relaxation variable.
This paper adopted three approaches to avoid overfitting: first, the L2 norm is used to regularize all the kernel weights; second, the early stopping is used to halt the network and third, data augmentation is done during training. The learning rate is set to 0.001. The regularize parameter (L2) coefficient is set to 0.01. The momentum of batch normalization (BN) and alpha of Leaky ReLU is set to 0.95 and 0.1, respectively. We set the batch size to eight. During training, we have used data augmentation to avoid overfitting on the small datasets. During data augmentation, different patches of scale [200×200×3] are extracted from the original frames as well as volume of frames containing crowd scenes only. One third of the extracted patches are rotated with 250, 300 and 450 randomly. The size of data augmentation contains 70% of original training samples.
Datasets and performance matrices
Datasets
To demonstrate the experiment, we have used three publicly available crowd panic datasets: the UMN, the MED, and the Pets-2009. Table 2 shows the properties of these three datasets.
Description of crowd panic datasets
Description of crowd panic datasets
The UMN [26] dataset contains videos having real-world scenarios of normal and crowd escape-like panic behavior. The dataset is captured from three different environments: lawn, indoor, and plaza. It contains eleven crowd videos with medium crowd density. The resolution of these sequences is [320 × 240]. The MED [27] dataset contains both normal as well as panic behavior in the crowd. The videos are captured from the walkways. Among 31 videos, 11 videos contain crowd normal as well as panic situations.
The crowd density varies from sparse to high. The resolution of such a dataset is [854 × 480]. The Pets-2009 [28] dataset is captured from a campus view containing real-world crowd normal and panic events. It contains 855 frames with low crowd densities having a resolution of [768 × 576]. We have performed data augmentation for the MED dataset during training. The Fig. 4 shows some examples of samples of three datasets.

Examples of samples of the datasets. Figures (a), (b) and (c) are the examples of normal scenes of UMN S1, S2 and S3 respectively. Figures (d) and (e) are the normal scenes of Pets-2009 dataset. Figure (f) is the example of normal scene of MED dataset. Figures (g), (h) and (i) are the panic scenes of UMN S1, S2 and S3 respectively. Figures (j) and (k) are the panic scenes of Pets-2009. Figure (l) is the example of panic crowd scene of the MED dataset.
Let TP,FP,TN,FN,ER represent true positive, false positive, true negative, false negative, and error rates. The available datasets contain two things: the normal sequences followed by the panic situations. During training, only normal sequences are used to train the model. So, by considering these facts, the following confusion matrix (Table 3) can only be drawn from the results on each sequence.
Confusion matrix
Confusion matrix
The following three performance measures have been used to evaluate the proposed model’s efficiency.
Here K is the total number of frames.
The programming is written in python using Keras and TensorFlow. The model had been executed on intel-i7 8th Generation Laptop with 16 GB RAM, 4 GB GPU as well as on Google Colab.
The UMN dataset
The comparisons of the performance of the proposed MuST-POS with other state-of-the-art panic detection approaches are illustrated in Table 4. All the eleven sequences of UMN contain around 7739 frames, including both normal and panic crowd behavior. The MuST-POS achieves average detection accuracy of 99.40%, which is the same as DeepROD [31]. The ER of the proposed model is 0.60. The performance of the proposed model lowers in the second scene (S2) of the UMN dataset and obtains an accuracy of 94.40%. From Table 4, it can be noticed that the proposed model performs better than the recent works in crowd panic detection like [30, 32–34]. The following Fig. 5 shows the results obtained on some of the samples of the UMN dataset. Figure 6 illustrates comparisons of average accuracy and average ER of several approaches on the UMN dataset using a stacked bar graph.
Comparative analysis of results on crowd panic datasets
Comparative analysis of results on crowd panic datasets

Examples of output of the proposed model on the UMN dataset. Figures (a), (b), (c) are normal sequences, and the model predicted as normal, figures(d), (e), (f) belong to starting of panic behavior, and the model predicted as panic and figures (g), (h), (i) is the panic frames, and the model predicted as panic.

Comparison of average accuracy and average error rate between several approaches on three datasets.
So, by observing the performance of the proposed model on the UMN dataset, we can conclude that the model efficiently deals with the scale variation issue by extracting multiscale spatial-temporal features from the crowd scenes improves the performance of the model.
The performance analysis on the MED dataset is illustrated in Table 4. The performance of the model is compared with the most recent work [31, 42]. Models like [31, 32] and [42] obtain average detection accuracies of 95.60, 94.50, and 74.82, respectively. Whereas the proposed model obtains the detection accuracy of 97.61% with 2.39% of error rate. The proposed model tops in the list as far as performance analysis is concerned. The following Fig. 7 shows results obtained on some of the samples of the MED dataset. Figure 6 illustrates a bar graph comparing the average accuracy and average ER of several approaches on the MED dataset. The performance of the proposed model on the MED dataset itself shows the efficient utilization of multiscale spatial-temporal feature modeling for crowd panic detection.

Examples of output of the proposed model on the MED dataset. Figure (a) is the normal sequence, and the model is predicting as normal, figure (b) shows to starting of panic behavior, and the model is predicting as panic, and figure (c) shows the panic frame and the model is predicting as panic.
The proposed model achieves 98.37% of accurate detection of crowd panic behavior. The error rate of the MuST-POS on Pets-2009 dataset is 1.63%. We followed the same training and testing criteria for the Pets-2009 dataset as mentioned in [30, 35]. The proposed model achieves the highest accuracy and lowest false alarm rate as compared with recent approaches of crowd panic detection. Thus, the model can handle the scale issue due to perspective distortion in the crowd panic video datasets. Figure 8 shows examples of results on few frames of the Pets-2009 dataset. Figure 6 illustrates a bar graph comparing the average accuracy and average ER of several approaches on the Pets-2009 dataset.

Examples of output of the proposed model on the Pets-2009 dataset. Figure (a) is the normal sequence, and the model is predicting as normal, figure (b) shows to starting of panic behavior, and the model is predicting as panic, and figure (c) shows the panic frame and the model is predicting as panic.
The ablation study is performed to show the effects of each stream of the proposed model. The streams like MAS and MTS are the two critical modules designed to extract multiscale spatial and multiscale temporal features from the crowd panic videos. It is essential to show the influence of each module on panic detection. We split the network into two different modules: the multiscale spatial 3D atrous net and PCA guided OC-SVM (MuS-POS) and the multiscale temporal 3D atrous net and PCA guided OC-SVM (MuT-POS). In addition to this, we have also experimented with single-scale analysis using single-scale
Result analysis on the ablation study
Result analysis on the ablation study
The two streams perform differently on the UMN dataset in which the average accuracies of MuS-POS and MuT-POS are 95.54% and 96.78%, respectively. In this case, the spatial stream performs better than the temporal stream. By combining these two streams (MuST-POS), the performance still further improved. The UMN dataset is very challenging and contains different crowd panic situations in different environments. On the other hand, the MED and Pets-2009 datasets have multiple crowd panic situations captured from a single environment. The MuS-POS and MuT-POS obtain average detection accuracies of 92.63% and 93.26% respectively on the MED dataset. However, in Pets-2009 dataset MuS-POS and MuT-POS achieves average detection accuracies of 92.37% and 94.75% respectively.
During single scale analysis the Single-Scale POS obtains average accuracies and average F1-Score of <94.80%, 96.63% >, <89.34%, 93.11% >and <88.98%, 91.25% >on UMN, MED and Pets-2009 datasets respectively. The Fig. 9 shows some panic samples contain crowd scale-variation which are not detected by Single-Scale POS but are accurately detected by MuST-POS. By comparing the results of Single-Scale POS with Multi-Scale POS it can be concluded that the multi-scale analysis is desirable not only to achieve better accuracy but also to address scale change due to perspective distortion. On the other hand, the MuST-OS (without PCA) obtains average accuracy and average F1-Scores of <98.58%, 99.13% >, <96.59%, 97.81% >and <96.96%, 97.56% >on the UMN, the MED and the Pets2009 datasets respectively. The performance of MuST-OS (without PCA) is lower than the MuST-POS.

Samples of panic situations which are detected as Normal by Single-Scale POS but are detected as Panic by the proposed MuST-POS.
This paper proposed a multiscale spatial-temporal 3D atrous net for modeling multiscale features for addressing the scale variation issue due to perspective distortion in the crowd scene. The dimension of the extracted multiscale feature is high, and we adopted PCA to reduce the dimension. Then we adopted OC-SVM for fitting the normal crowd behavior. The outliers of the OC-SVM are the crowd panic events. We used three publicly available datasets for performance evaluation. The UMN dataset is challenging among these three datasets as the data sequences are captured from the three different environments with some challenging real-time situations.
The proposed model achieved 99.40%, 97.61%, and 98.37% of average detection accuracies on the UMN, MED, and Pets-2009 datasets. We have conducted an ablation study to show the effect of individual modules of the proposed MuST-POS model and the performance of Single-Scale POS and MuST-OS (without PCA). Table 5 shows that on the three datasets, the MUT-POS and MUS-POS could not perform better as far as the recent state-of-the-art approaches are concerned, but by combining these two models, we achieved one of the best results. But on the MED and Pets-2009 dataset, both the stream performs better than the state-of-the-art approaches. The Single-Scale-POS performs lower than MuST-POS in terms of accuracy, ER, and F1-Score in single-scale analysis. In Fig. 9, this paper also demonstrated some samples with scale variation issues, and the Single-Scale POS cannot detect it, but the MuST-POS can detect it accurately. In the ablation study, this paper also demonstrated the effect of the proposed model with and without PCA. In this case, also the proposed MuST-POS achieves better results (Table 5) than the MuST-OS. Hence, by considering the model’s outcomes, we can conclude that the proposed model handles the crowd scale change due to perspective distortion effectively as far as crowd panic detection is concerned. However, there is always a research gap in this domain: the availability of largescale variants of crowd panic datasets and exploring cross dataset or domain gaps for CPD. So, the future study will focus on addressing these research gaps.
