Abstract
Image Incorporation concerns, including background confusion, uneven population distribution, and variations in scale and familiarity, can make group counting difficult. Pre-existing information and multi-level contextual representations are required to handle these problems effectively with deep neural networks and Mask-RCNN. Numerous studies on crowd counting use density maps without segmentation, which treat a group of individuals as a single entity. This article offers a hybrid method for crowd counting that combines Mask-RCNN (MRCNN) and a bidirectional convolutional long-term memory network (ConvLSTM), dubbed (CC: MRCNN-biCLSTM). The CC: MRCNN-biCLSTM is based on the Mask-RCN; it first segments instances and generates density maps, which are passed into adversarial learning during the training phase. Finally, the bidirectional convolutional LSTM is being used to return metrics and counts for individuals within a group of individuals. Following that, the suggested activity detection technique based on the Bayesian non-linear filter AD-BNF is used to identify a person’s activity. Additionally, the suggested approach resolves human grouping and enhances metric performance. Extensive studies demonstrate that the suggested method outperforms more sophisticated techniques on four frequently used difficult criteria for density map precision and quality.
Introduction
Crowd analysis is a challenging subject within computer vision, and basic methods for crowd face detection perform poorly due to visual blocking, scene semantics, and overlapping subjects [1, 2]. The development of crowd counting techniques can be generalized to other relevant areas, such as vehicle counting and environmental studies, without considering their characteristics [3, 8]. As a result, many researchers are engaged in counting the crowd, and many excellent works and literary works are emerging. High-quality crowd density map predictions are useful for object recognition, action recognition, crowd analysis, and ultimately violence detection analysis in crowd settings. However, it is not easy to count groups of people due to general imaging conditions and behavior. Deep learning techniques help detect activities in precise numbers, stakeholders, and large crowds in all weather conditions. Deep convolutional neural networks overcome these problematic problems [5]. Most of them design feed-ahead network architectures and predict density maps in one go. These network models do not consider Semantic priorities before feature extraction, discouraging feature representation under challenging areas. Some previous methods have incorporated semanticsbefore using a multi-step (or improved) model to address this problem. For example, Liu et al. [6] Use previously trained classification network class activation maps to support uncongested areas’ improvements.
C. Liu [7] discovered RAZNet. This actively expands the local dense crowd’s area to improve the initial prediction. Recently, Liu, Yongtuo, et al. [12] has developed a crowd count using the cross-stage refinement network (CRNet). In this task, Liu identifies the density maps, feeds them into adversarial learning, and calculates an approximate count based on the previous grayscale. It then goes into the LSTM convolution and returns the metric and count. CRNet calculates the density map directly without segmenting the instance. This makes many private groups considered as one object. The existing approach has made great strides, but there is still room for improvement.
In addition to displaying crowd features, monitoring a person’s activity [41] is a more critical problem when predicting crowd density. Also, the investigation of unusual human activity in crowded scenes is another critical problem. The human The movement performance map effectively reflects the speed of movement, the direction of movement, and the size of the objects or subjects and their interactions in frame order [9].
CRNet is extended by Liu’s [12] to count density maps directly without instance segmentation. This measures a group of many people as a single entity. It solved the problem of the existing model of grouping people and improved the metric’s performance. Then, suspicious human activity [41] is detected using a model based on multiple CNNs (MCNN). First, we propose a hybrid algorithm for counting crowds using Mask-RCNN (MRCNN) with bidirectional ConvLSTM (CC: MRCNN-biCLSTM) consisting of several fully convolutional networks. Most of the low and medium-level characteristics are lost during ROI wrap and convolution operations, so looking at the above predictions alone is not enough information. In this article, we have applied the Mask-RCNN (MRCNN) [10] to create a segmentation of instances to improve information flow in the different stages.Once the density maps are generated, it is passed into adversarial learning. Then the Bidirectional convolution LSTM isused to return the counts and metrics. The layers connect all fully convolutional networks multi-scale with biConvLSTM [11]. This allows for the adaptive propagation of multi-scale contextual representations on the same network, as well as the encouragement of current prediction to investigate prior multi-level and multi-level density levels in the network. Multi-stage refers to the fact that the present forecast may investigate the earlier densities of all last stages, rather than simply the nearest neighbouring one, and “multi-level” refers to past departures and low characteristics that have occurred in the previous stages. Despite the uneven distribution of individuals in unstructured crowd situations, density preferences may regulate the process of extracting a characteristic to direct people’s attention to busy regions and distract them from other people in the background, according to the research.
Second, to meet the crowd scene’s limitless environmental challenges, we explore different crowd-specific data augmentation methods that enrich the crowd functionality in various aspects. Specifically, they alter the crowd’s images based on the camera’s color, size, and perspective. In addition to this, we provide an activity detection method based on the Bayesian non-linear filter (AD-BNF). Suspicious human activity is detected using the proposed AD-BNF algorithm. First, the image is transmitted through the Oriented Histogram Gradient (HOG) with a Support Vector Machine (SVM) to extract an image’s characteristics.
Furthermore, the functionality is passed to a non-linear Bayesian filter for position refinement, which returns the region of interest (ROI) to generate annotations about the ROI. These annotated regions are passed to multiple CNN-based models (MCNN) to detect suspicious human activity. In-depth studies on four reference data sets demonstrate that the proposed CC: MRCNN-biCLSTM surpasses current-generation approaches in terms of accuracy and quality of the density map, compared to previous-generation methods. In addition, the performance of AD-BNF is superior to that of current approaches for detecting human activity. To summarise, our contributions are as follows: fivefold: We propose a hybrid algorithm for counting the crowd of people using Mask-RCNN (MRCNN) with bidirectional ConvLSTM (CC: MRCNN-biCLSTM)to combat numerous entirely criminal networks. Firstly, we apply MRCNN to create instance segmentation. After that, refined and high-quality crowd density maps are generated iteratively. We present the bi-directional connections with biConvLSTM. Multi-phase and multi-level density connections are all linked to different phase networks. To further strengthen the crowd’s features, we look for different ways to augment crowd-related data. In particular, we look at the colour, image size, and camera point, depending on the count’s accuracy and the density map’s quality—a comprehensive experimental demonstration of the various magnification strategies is performed. Finally, we propose an activity detection method based on the Bayesian non-linear filter (AD-BNF), which detects a suspicious human using multi-CNN (MCNN).
This document is broadly standardized: The following section focuses on relevant CNN reviews and activity detection-based approaches reported in the literature. The proposed algorithms CC: MRCNN-biCLSTM andAD-BNF implementation are explained in section III. The results obtained from experiments conducted on several datasets are reported in terms of Mean Absolute error (MAE) and Root Mean Square Error (RMSE) in section V. Finally, conclusions are drawn in Section VI.
Preliminaries
People have come up with a lot of ways to deal with the crowd counting job. CNN-based methods can be grouped into two main groups.
CNN-based methods
Several CNN-based methods have been suggested, and crowd counting [42] has made big progress. Still, most of them are focusing on how to deal with huge changes in crowd images. Marsden et al. [13] came up with a residual deep learning architecture that could be used for crowd counting, spotting violent behaviour, and classifying crowd density levels. They practised and tried out the multipurpose method. A new dataset of 100 images was made that fully annotated the first computer vision dataset for crowd counting, spotting predatory behaviour, and grading density levels. This new dataset includes 100 images that are fully annotated. Shi et al. [16] made a system that can do a lot of different things at the same time. It’s called NetVLAD. In this case, it is a way to make a crowd-counting computer network (CNN) more powerful. In the same way, the same task is used to learn generalizable representations in the same domain to make the fit better because there aren’t many training samples to work with. Also, Shi et al., Compiles multi-scale conventional features extracted from the entire image into a compact single vector representation for efficient and accurate computation via Vector of Locally Aggregated Descriptors (VLAD). Zeng et al. came up with DSPNet, which is a network for cleaning up congestion counting. DSPNet can be used to store multi-tiered functions and keep important information from being lost for dense congestion counting. When Liu et al. [6] make a binary classification network, they can tell the difference between congestion images and background images and remove the non-congested areas shown by the square activation map. These two methods rely a lot on coarse and reliable partitioning or class activation maps, which can lead to unreliable fine-tuning. C. Liu et al. [7] are proposed to train another network to scale up to local areas and refine the localization map with a high probability of local errors (mainly in areas with dense crowds of people). This technique can overcome problems associated with large-scale variations. Still, sophisticated fields are limited to high-density regions, ignoring other parts of the scene. Inspired by the work of [15] and [16], We want to start with density values from previous time steps, either in the process of extracting features or on a high-resolution dense map of the world.
Data enhancement is crucial for counting crowds. Most of this previous methods [6, 17–21] use random crops and flipping to increase data. These techniques can increase the volume of the data and improve counting accuracy to a certain extent. However, they do not consider the environmental aspects of the constrained group scenes, which further limit the representation of the learned audience. Recently [12], crowd counting has been developed through Cross Refinement Networks (CRNET).
On the MHEALTH dataset, the proposed deep learning model is assessed using two separate accuracy validation processes and three different complexity metrics. The suggested methodologies have been shown to be accurate to 96.92 percent for HAR and 97.06 percent for cardiac stress level. The models for inertial data and ECG-based prediction are also smaller than existing methodologies in the literature, weighing in at 0.89 MB and 1.97 MB, respectively [37].
In this framework, we describe a CNN-based activity segmentation algorithm that may reduce the reliance on experience and address performance deterioration. To improve overall performance, we offer a feedback system where the segmentation algorithm is updated based on activity recognition results. The experiments show that DeepSeg outperforms current approaches [38]. The proposed decoupled approach eliminates error propagation between two processes and provides higher location estimate performance. On the other hand, we propose an amplitude measurement with temporal, spatial, and frequency domain information. This has been shown to be competitive for feature extraction compared to other methods. Extensive experimental and modelling results demonstrate the proposed system’s benefits [39].
SARD, a custom-built dataset to replicate rescue circumstances, was used to assess the dependability of current state-of-the-art detectors such as Faster R-CNN, YOLOv4, and Cascade R-CNN. They were compared after training on certain datasets. The YOLOv4 detector was chosen for further study because to its speed, accuracy, and low false detection rate. We studied network size, detection accuracy, and transfer learning parameters. Resistance to weather and motion blur was also tested. Due to its excellent results in people identification, the study presents a paradigm for SAR operations [40].
In this work, Liu identifies the density maps and then passes them into adversarial learning to count the approximate amount based on the previous grayscale. Further, it gives into convolution LSTM to return metrics and counts. The CRNet directly counts density map without instance segmentation due to which many groups of person may count as a single entity. We have extended the work of Liu [12] and proposed a hybrid algorithm for counting the crowd of people using Mask-RCNN (MRCNN) with bidirectional ConvLSTM (CC: MRCNN-biCLSTM) because CRNet directly counts density map without instance segmentation due to which many groups of person count as a single entity. We have solved the existing model problem of a grouping of people and also improved metric performance. Our proposed CC: MRCNN-biCLSTM approach is explained in section III.
Unusual human activity detection in crowd scenes
This section discusses methods for detecting abnormal human activity in crowded scenes. Lee et al. [23] discussed motion influence maps for the unusual detection and localization of human activity in crowded locations. In particular, instead of finding or dividing humans, [23] devised an efficient method called motion performance maps to represent human work. The primary characteristic of motion performance maps is that they accurately depict the speed of movement, direction of motion, or size of objects or people and their interactions in image sequence. Jin et al. [24] created a multi-CNN helper action descriptor for real-time activity recognition in video surveillance. To overcome this representation difficulty, the work [24] proposes a useful action descriptor for detailed action detection. The sub-action description is divided into three levels: posture, movement, and gesture. The three levels define three distinct sub-actions for a method that addresses the representation challenge. Using appearance-based temporal features and several CNNs, the proposed action detection model simultaneously locates and recognises many persons’ activities in video surveillance. [24] has been developed to identify unexpected human behaviour in crowded environments. Three temporal characteristics are retrieved for BDI-CNN, MHICNN, and WAI-CNN, respectively, depending on the existence of human action areas [23]. The first network, BDI-CNN, accepts BDI as input and estimates the size of the actor. MHI-CNN, the second network, is based at MHI and documents the actor’s movement history. The third network, WAI-CNN, is based on WAI and records both the size and speed history of the actor. The three CNNs are progressively trained on action categorization tasks. Choosing the appropriate CNN [43] architecture for a particular issue is difficult since it is application-dependent. We designed a lightweight CNN architecture to reduce computation time to detect human activityin real-time.The proposed activity detection method is based on the Bayesian non-linear filter (AD-BNF) discussed in Section III.
The following main approachable points in this research work. Proper split of train, validation and testing parts from dataset. Image preprocessing and augmentation by translation, ratiotaions,scaling and transformation makes input images in regular forms. Added regulation layers in form dropout and L2. Define a constant filter to generalize the features extraction. Less complex architecture model selection.
Proposed work
In this section, we first introduce the network architecture of the proposed hybrid algorithm for counting crowd of people using Mask-RCNN (MRCNN) with bidirectional ConvLSTM (CC: MRCNN-biCLSTM). Then a person’s activity is detected using the proposed activity detection method based on the Bayesian non-linear filter (AD-BNF). Fig. shows the architecture of the proposed model.

Architecture of Proposed work.
Modular level Instance segmentation in dense populated frames, are key advantages of the proposed model over mentioned reference. Extraction of features and multiple every patch with 2×3 filter makes it enabled.
The main contributions (innovations) of our work are as follows: Used Selective search Multiple each pixel matrices with constant 2×3 filter to make it genelize. Feature vector using ResNet101 N SVMs with polynomial functions. Regression on parameters.
Object Detection: Mask-RCNN (MRCNN)
A modification of the Mask-RCNN [31] framework, which is the state-of-the-art technique for detecting objects and which achieves a higher degree of accuracy when observing objects at various stages, is presented in this paper. In Fig. 3, the whole model trains Apache MXNet, which decreases the amount of time necessary to train the Amended-Mask-RCNN (A-MRCNN) model. Apache MXNet is represented by the dotted outside line on the right side of the figure.
It is necessary to apply ROI pooling in order to get the best results with Original Mask R-CNN [26]. The drawback of ROI pooling is that it decreases picture resolution when images are passed through the FC and Softmax layers.
[26] introduced ROI aligns, however ROI-Align performs better on datasets with a smaller bounding box and fewer items to detect.
For this reason, ROI wrapping is utilised in place of ROI pooling in Resnet-152 and ROI Align in Mask R-CNN in order to overcome the issue. When a defined ROI on the feature map is wrapped and warped to a specific dimension, this is referred to as ROI wrapping. While ROI warping and ROI aligns are similar in function, the difference between them is that wrapping changes the contour of the feature map, while ROI align applies bipolar interpolation to stretch or contract the image to the same dimension as ROI warping. Figure 2 depicts the Resnet-152 architecture [35] in more detail.

ResNet Architecture [35].

Facial Mask Detection with Instance Segmentation using extended Mask R-CNN.
Kalman filter used by Original Mask R-CNN [26] is replaced with a non-linear Bayesian filter. Figure 3 shows the Facial Mask Detection with Instance Segmentation using Amended Mask R-CNN.
Algorithm 1 discusses the steps used in the proposed Amended-Mask R-CNN (A-MRCNN). Firstly, in step 1, the algorithms start by initiating Apache MXNet. In the next step 2, begin the A-MRCNN algorithm. In step 3-5, Resnet-152 extracts features from image I and return them to F. after that, in step 6-8, the region proposal network produces some regions containing images. In step 9-10, these regions are passed to ROI-Wrapping, which returns regions of interest and crop and wrap these regions. In step 11, the mask is detected. Finally, the algorithm stops at step 12. The Adversarial Algorithm is called in step 13. The subsequent section discusses the Adversarial Learning Algorithm in detail.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11: If (IOU > 0.5)
{
LABEL==Mask
}
Else
{
Skip
}}}
12.
13.
As mentioned below, we use a GAN design that combines the predicted mask M with the original image X as the generator network’s input:
Where, C → channel, H → height, w→ weight.
Then the output Y is a map of the predicted density. Indicated as a discriminating network:
G attempts to construct a density distribution that is comparable to the fundamental truth for this mapping job, while D attempts to discriminate between true and false densities. Our technique eliminates the need for two maps to be trained, and a single map is more efficient. Notably, our objective forecasts a density map illustrating the geographical distribution of pedestrians in the original picture of the crowd view. It is not essential to assign a density map to a picture for practical application.
1) Develop a density map
In the annotation data, each head’s pixel is labelled with one and the remaining pixels are labelled with zero. The training process is made more difficult by this kind of scant information. The current approach for generating a density map suited for deep learning is based on a Gaussian kernel density distribution, which is a Gaussian kernel density distribution. On each pixel of the annotation, the Gaussian kernels convolve with one another. We examine the adaptive Gaussian kernel under various forms of standard deviation, which is defined as:
The function f(σ i ) = (σ i ) 2, a → different kernals [36].
2) Adversarial Loss:
We express the adversarial loss as:
X is the original picture sample, and m represents the mask sample that corresponds to it. The anticipated value is denoted by the letter E. The anticipated density map G(x;m) is represented by the variable y, while the appropriate ground truth is represented by the variable z. G is attempting to reduce this loss in opposition to the hostile D, who is attempting to increase it. To put it another way, D decides if the sample is a genuine one or a fabricated one created by G. The parameters of the network layers of the G and D, respectively, are denoted by the letters θg and θd. In order to get these values, we must minimise the loss ofGnet while simultaneously maximising the loss of Dnet. In our technique, we apply L loss to the elements of the genuine density labels as well as the projected density labels. It is indicated as a loss in the density map:
3) Final Objective:
The final objective, called density loss, is:
Where λ keeps the balance of the two objectives. For training,
We set λ = 10 in (3).
The Euclidean distance measures the difference between the ground truth count and the prediction count for training this branch. The loss function called counting loss is defined as follows:
Where N is the number of training images. L c is the counting loss between the regressed count C and the ground truth C GT . The loss is minimized using mini-batch gradient descent and back-propagation.
1. For every image mask {
2.
3. rf,fd=Gnet(Xm); Xm→ combined image with mask, rf → resnet feature,fd →fakedensity
4. d out = Dnet(fd)
5. fc = Cnet(rf); fc →fake count
6. LDM = L1 loss(fd, di)
7. LA = MSELoss(dout,l)
8. Lfinal = LA=+λLDM
9. LD = MSEloss(dout,0)+MSEloss(D net (d i ),l)
10. LC =MSELoss(f c , C i )
11. Train Gnet, Dnet
12. end()}}
13. call Bi-ConvLSTM(D net or density maps)
Algorithm 2 begins with iterations over image and masks pair, and in each iteration, Gnet generates fake density maps represented by fd, and fdis used to create Dnet. Dnet uses the fake count algorithm to compute adversarial loss LA and density map loss LDM. These losses are used to compute density loss further and count after all these parameter losses are computed Dnet, Gnet is trained on the optimized value of these losses, and the algorithm ends. After this algorithm,Bi-ConvLSTM is called, which takes Density maps as parameters.
In contrast to ConvLSTM, BiConvLSTM [25] maintains two sets of hidden and cell states per LSTM cell, making it a more advanced version of the algorithm. Specifically, one for forward sequences and another for backward lines are provided. Due to this wide access to context in both directions of the input timeline, BiConvLSTM may provide you with a more comprehensive knowledge of the whole movie. The function of the BiConvLSTM cell is shown in Fig. 4. It is made up of ConvLSTM cells that have two sets of hidden and cell states in them. The first set (h f , c f ) is for the forward path and the second set (h b , c b ) is for the backward path. Stacking the two sets of hidden states that correspond to each other and passing them through the convolution layer for each time series step results in the final confidential representation of that particular time step. This hidden representation is provided as an input to the BiConvLSTM module at the next stage of development.

Workflow of BiConvLSTM [25].
1.
2.
4.
5.
6.
7.
8.
9.
10. End ()
Algorithm 3 begins with w, then iterates over sample batches and feeds a sequential set of samples into Bi-LSTM. This returns hidden states h over forget gate f for each iteration and returns hidden state h with bias terms b for each iteration. Then concatenate these to vectors, flatten them, and call this vector G. Then, feed this vector G into the Softmax layer, and these return RSME and update the weight w with the given function.
This section discusses the proposed activity detection method based on the Bayesian non-linear filter (AD-BNF). The suspicious human activity is detected using the proposed AD-BNF algorithm. Firstly, the image is passed through a histogram of oriented gradients (HOG) with a Support Vector Machine (SVM) to extract features from an image. Furthermore, the components are given to a non-linear Bayesian filter to refinea position, which returnsa region of interest (ROI) to generate annotations over ROI. Then these annotated regions are passed to multi-CNN (MCNN) based models to detect suspicious human activities.
1.
2.
3.
4.
4.1. Extract HOG features from I,
4.2 Append I to features matrix F }
5.
5.1
5.2 Reiterate region Matrix.
6.
7.
O (x, y, t) = ∝ O (x, y, t - 1) + D (x, y, t)
t is time, ∝ is update rate 0 < ∝ <1
11.1 store O into a motion feature matrix
8. If (motion in motion feature matrix > 0)
{
Message==Abnormal
}
else
{
Message==Normal
}}
9. End ()
Algorithm 4 discusses the steps of AD-BNF. In step 2, the dataset is loaded. After that, in step 3, the iteration over each dataset image is done. In step 4, the picture I is run through a histogram of oriented gradient (HOG) using support vector machine (SVM) in order to extract features F from it. In step 5, F is passed to a non-linear Bayesian filter to refine positions, and it also returns regions of interest ROI. In step 6, at these step annotations are also generated over regions. These labelled zones are sent on to multi-CNN (MCNN) appeared-based models, which are used to identify suspicious human activity in the surrounding area. Finally, in step 8, we can witness both aberrant and regular activities. Then, the algorithm terminates at step 9. Figure 5a shows the workflow of activity analysis detection.

Workflow of activity detection.
This section will first introduce assessment metrics and data sets, then a study to demonstrateeach module’s effectiveness. Besides, we evaluate and compare our proposed method with many state-of-the-art techniques.
All regions are passed through first option with polynomial functions and a constant filter factor which stables the model architecture and helps in avoid overfitting.
Experimental environment
The Python design incorporates a programming language, a 15.6-inch high-definition WLED touch screen with a resolution of 1366 x 768 pixels, and 10-finger multi-touch capabilities. i7-1065G7 processor, 8 GB DDR4 SDRAM, 512 GB SSD (no optical drive), Intel Iris Plus graphics, HD audio with stereo speakers, HP True Vision HD camera, Realtek RTL8821CE 802.11b/g/n/ac WiFi, Bluetooth 4.2, 1 HDMI1.4, 1 USB 2.0, 1 SD card reader There are three USB 3.1 Gen 1 Type-A ports and one USB 3.1 Gen 1 Type-C port. When using the Windows 10 64-bit operating system platform, Python programming is supported. During the implementation process, the Python libraries are used.
Datasets description
1) ShanghaiTech (SH part A and B)Dataset: ShanghaiTech [28] is a collection of 1,198 photos that have been annotated by 330,165 people. This data collection is broken into two parts: Part A, which contains 482 images, and Part B, which has 716 images. The training and testing pictures in Part A were 300 photos, and the tests were 182 photographs, in accordance with Zhang et al. [28]. In Part B, the training and testing images were 400 images, and the tests were 316 images..
2) UCF_CC_50 dataset: It was Idress et al. [30] who released the first version of UCF CC 50 [30]. It comprises of just 50 crowd photographs, each with a distinct resolution and viewpoint, as well as an exceptionally high density of people in them. In each picture, the number of annotated faces varies from 94 to 4543, with an average of 1280 per image. The average trapping density for the UCF CC 50 dataset is much greater than the average trapping density for the ShanghaiTech dataset (which is substantially lower).
Evaluation criteria
According to past research on crowd counting, we use the mean absolute error (MAE) and the root mean square error (RMSE) to evaluate the efficacy of the comparison techniques in our studies, which is consistent with earlier work. The following are the definitions for them:
Where M is the total number of test samples. z
i
and
Results and discussions
This section shows the proposed CC’s performance: MRCNN-biCLSTMand human activity analysis for the proposed AD-BNF algorithm.
1. Performance of CC: MRCNN-biCLSTM on SH part-B dataset and comparison of other models using MAE and RMSE.
Figure 6 represents the performance of a proposed model on Shanghai part A dataset. Comparingthe model’s performance with other models is done using mean absolute error and root mean square error. An evaluation metric graph explains that RRSP gives 63.1 as MEA and 96.2 as RMSE, ADCrowdNet gives 63.2 as MEA and 98.9 as RMSE, RAZNet gives 65.1 as MEA and 106.2 as RMSE,CRNet gives 56.4 as MEA and 90.4 as RMSE. Our proposed model gives the least value for the error as 53.1 as MAE and 89.8 as RSME.

Performance analysis of CC: MRcnn-biCLSTM on SH Part A.
Figure 7 represents the performance of the proposed model on the Shanghai part B dataset. A comparison of the model’s performance with other models using mean absolute and root mean square errors is given. An evaluation metric graph explains that RRSP gives 8.7 as MEA and 13.5as RMSE, ADCrowdNet gives 7.7 as MEA and 12.9 as RMSE, RAZNet gives 8.4 as MEA 14.1 as RMSE, CRNet gives 7.4 as MEA and 11.9 as RMSE. Our proposed model offers the least value for the error 7.3 as MAE and 11.4 as RSME.

Performance analysis of CC: MRcnn-biCLSTM on SH part-B dataset.
Figure 8 represents the performance of the proposed model onUCF_CC_50 dataset. A comparison of the model’s performance with other models using mean absolute and root mean square errors are shown. An evaluation metric graph explains that ADCrowdNet gives 257.1 as MEA and 363.5 as RMSE; CRNet gives 203.3 as MEA and 263.4 as RMSE. Our proposed model gives the least value for the error as 203 as MAE and 243 as an RSME.

Performance analysis of CC: MRcnn-biCLSTM on UCF_CC_50 dataset.
2. AD-BNF: Video-AP is reported for the BDI-CNN, MHI-CNN, WAI-CNN, and planned AD-BNF, among other configurations. The AD-BNF outperformed the AP measures by a large margin, demonstrating the importance of the combined signals in the action recognition process.
a) SH Part A & B:
Figure 9 represents the comparison between the proposed method and existing action recognition methods from the datasets.

Action recognition using SH Part A& B dataset.
Figure 9 represents the proposed model’s performance for activity detection on the Shanghai part A and Shanghai part B datasets. Comparison of the model’s performance with other models using average precision and mean average precision. An evaluation metric graph explains that BDI-CNN gives 77.8 as a value of accuracy for normal activity. And 53.9 as the value of precision for abnormal activity,an average of both these values is 65.8, represented by mAP column of the table and mAP bars in the graph. MHI-CNN gives 70.6 as a value of precision for Normal activity and 64.3 as the precision for abnormal activity average of both precision values is 67.4, which is represented by mAP column of the table and mAP bars in the graph. WAI-CNN gives 82.9 as a value of precision for Normal activity and 77.1 as the precision for abnormal activity average of both precision values is 80, represented by mAP column of the table and mAP bars in the graph. And OFSDI used in the proposed model gives 83.1 as a value of precision for everyday activity and 88.4 as the value of precision for abnormal activity. An average of both the precision values is 85.7,represented by the table’s mAP column and mAP bars in the graph. As we can figure from a graphical representation, the proposed model performs best in both the cases of normal and abnormal activities.
b) UCF-CC-50:
Figure 10 represents the performance of the proposed model for activity detection on UCF_CC_50 dataset. A comparison of the model’s performance with other models using average precision and mean average precision. An evaluation metric graph explains that BDI-CNN gives 75.8 as a value of precision for Normal activity and 51.9 as the value of precision for abnormal activity and average of both these values is 63.85 which is represented by mAP column of table and mAP bars in the graph. MHI-CNN gives 69.6 as a value of precision for Normal activity and 61.3 as the value of precision for abnormal activity. An average of these precision values is 65.4, which is represented by the graph’s mAP column of table and mAP bars. WAI-CNN gives 79.9 as a value of precision for Normal activity and 74.1 as the precision for abnormal activity. An average of these precision values is 77 which is represented by the graph’s mAP column of table and mAP bars. And OFSDI which is used in the proposed model gives 81.8 as a value of precision for Normal activity and 79.9 as the value of precision for abnormal activity and average of both these precision values is 80.8 which are represented by mAP column of table and mAP bars in the graph. As we can figure from the graphical representation, the proposed model performs best in normal and abnormal activities.

Action recognition using UCF-CC-50 dataset.
Figure 11a, 11b and 11c show an illustrative example of activity detection of human action in a crowd.

Shows the activity of human in a crowd.

Shows the activity of humans in a crowd.

Shows the activity of humans in a crowd.
The performance curve and loss function of proposed work has been shown in Fig. 12.

Performance curve and loss function of proposed work.
The work proposes some exciting research headings for future work from its application. This paper presentsa hybrid algorithm for counting people in a crowd using Mask-RCNN (MRCNN) with bidirectional ConvLSTM (CC: MRCNN-biCLSTM). A person’s activity is then detected using the proposed activity detection method based on the Bayesian non-linear filter (AD-BNF).In particular, thanks to the bidirectional convolution LSTM network. We also looked at many successful strategies for improving crowd-specific data, and we found that these resulted in considerable improvements in crowd counting accuracy. In-depth research on four regularly used complicated datasets have shown that our technique outperforms existing methods in terms of counting accuracy and density map quality.
Footnotes
Acknowledgments
The authors wish to thank the anonymous referees for their valuable comments and suggestions that contributed to improving the quality of the work in this paper.
