Multiple moving object classification and tracking using DenCNN classifier

Abstract

A Multiple moving object detection, tracking, and counting algorithm is mainly designed exclusively suitable for congested areas. The counting system can alleviate the betrayal performance in the crowded areas. Most of the existing methods developed for tracking and counting face serious challenges in detection due to high densities of the target. This condition urged the researchers to update the existing systems. The present methodology was designed to address such issues. In the present methodology, the contrast was initially enhanced between the objects and their backgrounds using a Double Plateau Histogram Equalization (DPHE). Then, the motion was estimated for the contrast-enhanced image to identify the moment of the object using the modified Adaptive Distance Covariance Rood Pattern Search (ADCRPS) algorithm. After that, the morphological operation was deployed to sharpen the images by removing all the unwanted things. Then, the features were extracted and important features were selected using the modified Chaotic Tent Shuffled Shepherd Optimization (CTSSO) Algorithm. With the selected features object, detection was done using the proposed Scaled Non-Monotonic Cauchy Dense Convolutional Neural Network (SNMC-DenCNN). The detected object was then tracked with the aid of Channel and Spatial Reliability Tracker (CSRT). Finally, the objects were counted by intersection over union (IOU) by explicitly computing the association between detected and tracked objects. Also, the experimental results showed the effectiveness and efficiency of the proposed system with enhanced accuracy.

Keywords

Double Plateau Histogram Equalization (DPHE)Adaptive Distance Covariance Rood Pattern Search (ADCRPS)Chaotic Tent Shuffled Shepherd Optimization (CTSSO)Scaled Non-Monotonic Cauchy Dense Convolution Neural Network (SNMC-DenCNN)Channel and Spatial Reliability Tracker (CSRT)

1 Introduction

Computer vision is a fundamental domain that aims to enable computer systems to analyze, extract information from, and interpret images in the same way that people do. It is widely used in many scientific applications such as detecting aberrant human behaviour, Human-Computer Interaction, Surveillance, Safety, Assisted Living, Photography, and more recently, Autonomous Driving [1]. Tracking and identifying moving objects are crucial component of all those applications. It serves as a crucial pillar for later target recognition [2]. So, Multi-Object Detection and Tracking (MODT) has become a crucial part of the computer vision [3]. It is also considered as the state-of-the-art technology with minimum inputs, this one can provide the best possible results [4].

The procedure involves monitoring the conditions of objects within a changing environment, assuming a reliable pose estimate is present, in order to determine the positions and paths of dynamic objects [5]. The unique ID linked to each object is specifically employed to gauge the quantity of unique objects within a video stream [6]. In certain tracking applications, some knowledge gaps, such as incomplete prior knowledge about the target’s appearance, ambient lighting, and scene clutter, make tracking challenging [7]. A simpler technique based on background subtraction was adopted because of a number of limitations and performance problems [8]. The computer vision algorithm background subtraction separates the foreground objects from the background. The foreground of an image is retrieved for additional processing, such as object recognition. [9]. The utilized approach compares each frame with a reference or background model [10] Multiple types of filtering are also used in pre-processing phase to remove noise [11]. After many years of efforts, the MODT has been accepted as a successful scenario with low density [12]. After the needing for updation in the current system [13], counting based on computer vision research has been induced in recent years [14]. Most of the existing counting methods focus on a particular or single category. However, when applying them to new categories, their performances dropped catastrophically. Meanwhile, it became extremely difficult, and also, said to be costly to gather all the categories, and to label them for training [15]. The counting method were divided into two categories: density-aware approaches and detection-aware approaches [16]. These kinds of methods usually perform well in counting objects [17]. At present, there are various algorithms used, such as Few-shot Object Counting Network (CFOCNet), Fast R-CNN, and Faster R-CNN and their variants (18). Yet, because of detecting errors, they frequently were unable to derive the accurate trajectories in congested environment [19] the system failed in different situations like wrong segmentation, tracking failures, and colour constancy problems [20]. To avoid such issues, the present research methodology used a novel method for counting by tracking. The present study aims at improving the experiment in tracking and counting objects.

The paper is set as follows. The Section 2 analyses the associated work regarding the proposed moving object detection and counting techniques. The Section 3 displays a concise discussion about detection, tracking, and counting. The Section 4 analyses the performance of the proposed system. Last, of all, section 5 wrapped up the paper.

2 Literature survey

2.1 Literature study

Ahmed Dirir et al. [21] developed an effective approach for multi-object counting and tracking that made use of correlation filters and deep learning principles. The dataset that the system used included sixteen videos with various attributes which was implemented in two stages. For object detection first stage used the most recent YOLO deep learning model (YOLO5 In order to prevent counting the same things more than once, the second stage combined the YOLO5 model with the Channel and Spatial Reliability Tracker (CSRT) to track and count objects inside a constrained range. The experiment demonstrated that the system performed better than the Kernelized Correlation (KCF) Filter in accurately identifying and counting objects in a different environment. Moreover, it took more time to deeply analyse the objects.

M.Vinod et al. [22] introduced a new morphological technique for object tracking and counting. The input dataset was taken to pre-process in the range of 1*258. Two-stage segmentation was developed for extraction namely morphological processing and region growing. In the case of objects occlusion, colour feature information was used to accurately distinguish between the objects. The method was able to identify moving persons and tracked them to provide a unique tag for the tracked persons. The overall performance of the system showed that the counting accuracy could be achieved in a superior manner. The effectiveness of the experiment was demonstrated only in an indoor environment.

Yi Wang et al. [23] introduced a novel self-training technique named Crowd-SDNet, which enhances a standard object detector trained solely with point-level annotations to accurately predict both the centre points and sizes of densely populated objects. The method begins by assuming a locally-uniform distribution to generate initial pseudo sizes for each object. To emphasize the pseudo sizes of crowded objects in size regression, a crowdedness-aware loss is employed. Additionally, a confidence and order-aware refinement scheme is developed to update the pseudo sizes, leading to the best performance in point-supervised detection and counting tasks compared to other detection-based methods. Notably, training the detector solely with pseudo bounding boxes poses a challenge in achieving acceptable detector performance.

Kang Hao Cheong et al. [24] suggested a low-cost, high-efficiency method that combines fully-automated human traffic tracking, identification, and counting on camera video feeds with computational object recognition. Two software implementations were explored, namely, Background Subtraction (BGS) and Single Shot Detector (SSD). These scheme’s performances were compared and validation against both controlled and uncontrolled real-world situations was shown. It proved the accurate counting by implementing both methods. Nevertheless, the SSD method yielded a maximum single miscount.

Lu Lou et al. [25] developed a novel moving vehicle detecting and tracking method based on traffic video. The mask R-CNN-based algorithm was deployed to determine the type of vehicle contour and vehicle location in each frame in a complex environment. The R-CNN used the COCO dataset for training. Then, the Kalman filter was applied to detect the vehicle target in between two continuous frames in the video sequence. The experiment showed that the better average accuracy and efficiency could deal with the different traffic conditions. Each frame took more time to identify and count vehicles, which showed poor real-time performance.

Vishal Mandal and Yaw Adu-Gyamf [26] introduced cutting-edge object detection and tracking algorithms aimed at identifying and tracking various vehicle classes within designated Regions of Interest (ROI). The ROI was crucial for achieving precise vehicle counts. The study explored numerous combinations of object detection models paired with diverse tracking systems to establish an optimal vehicle counting framework. These models were designed to tackle challenges posed by varying weather conditions, occlusion, and low-light scenarios, effectively extracting vehicle information and trajectories through computationally intensive training and feedback cycles. Experimental findings demonstrated that combinations such as YOLOv4 with Deep SORT and CenterNet with Deep SORT proved to be the most effective. It was observed that inadequate training data, particularly in scenarios with heavy traffic, could lead to significant challenges and congestion in the system.

Qinghe Zheng et al. [27] introduced a novel data augmentation technique for deep learning-based automatic modulation classification. By leveraging spectrum interference, it enhances the training data to improve the performance of modulation classification models. The innovative data augmentation approach is crucial for improving the robustness and accuracy of modulation classification systems, which are essential in wireless communication. In 2021 they propose MR-DCAE [28], a deep convolutional autoencoder that incorporates manifold regularization. This model is designed for identifying unauthorized broadcasting signals. Manifold regularization helps capture the underlying structure in the data, making the model robust to variations and unauthorized broadcasts. This paper contributes to the field of signal processing by addressing the critical issue of unauthorized signal detection using deep learning techniques.

In 2022 Qinghe Zheng et al. presented a state-of-the-art approach for fine-grained modulation classification [29]. By using a multi-scale radio transformer with dual-channel representation, the authors achieve exceptional performance in classifying intricate modulation schemes. This work is particularly valuable for applications requiring fine-grained modulation analysis, such as cognitive radios and spectrum sensing. The authors [30] addresses the important problem of network pruning in deep learning models. They propose a drop-path method based on the PAC-Bayesian framework to selectively prune convolutional networks. Efficient network pruning is crucial for reducing model complexity while maintaining or even improving performance, making it a valuable contribution to signal/image processing tasks. In 2023 they introduce DL-PR [31], a generalized automatic modulation classification method. It leverages deep learning techniques with priori regularization to improve the accuracy of modulation classification. This work is on the cutting edge of research, providing a comprehensive and adaptable solution for modulation classification tasks.

This paper [46] introduces a novel data augmentation technique for deep learning-based automatic modulation classification. By leveraging spectrum interference, it enhances the training data to improve the performance of modulation classification models. The innovative data augmentation approach is crucial for improving the robustness and accuracy of modulation classification systems, which are essential in wireless communication.

In this work, the authors propose MR-DCAE [47], a deep convolutional autoencoder that incorporates manifold regularization. This model is designed for identifying unauthorized broadcasting signals. Manifold regularization helps capture the underlying structure in the data, making the model robust to variations and unauthorized broadcasts. This paper contributes to the field of signal processing by addressing the critical issue of unauthorized signal detection using deep learning techniques.

This paper [48] presents a state-of-the-art approach for fine-grained modulation classification. By using a multi-scale radio transformer with dual-channel representation, the authors achieve exceptional performance in classifying intricate modulation schemes. This work is particularly valuable for applications requiring fine-grained modulation analysis, such as cognitive radios and spectrum sensing.

This paper [49] addresses the important problem of network pruning in deep learning models. The authors propose a drop-path method based on the PAC-Bayesian framework to selectively prune convolutional networks. Efficient network pruning is crucial for reducing model complexity while maintaining or even improving performance, making it a valuable contribution to signal/image processing tasks.

This paper [50] introduces DL-PR, a generalized automatic modulation classification method. It leverages deep learning techniques with priori regularization to improve the accuracy of modulation classification. This work is on the cutting edge of research, providing a comprehensive and adaptable solution for modulation classification tasks.

2.2 Related work according to different domains

In the field of computer vision, object detection and tracking have been subjects of extensive research, driven by their applications in diverse domains. In this section, we categorize and organize related work into several key areas.

Object detection

Deep Learning-Based Approaches: Deep Convolutional Neural Networks (CNNs) have revolutionized object detection. Works like Faster R-CNN [32] and YOLO (You Only Look Once) [33] have significantly improved detection speed and accuracy. These methods utilize CNNs for object localization and classification simultaneously.

Single Shot Detectors: SSD [34] and its variants offer real-time object detection with impressive accuracy. They achieve this by dividing the input image into multiple grids and predicting bounding boxes and class scores for each grid cell.

Region Proposal Networks (RPNs): RPNs, as employed in Faster R-CNN [32], propose regions of interest, which are then refined for object detection. This approach combines accuracy with region proposal efficiency.

Object tracking

Correlation Filter-Based Trackers: Methods like Discriminative Correlation Filter (DCF) [35] and its variations, such as CSR-DCF [36], have shown effectiveness in object tracking. They utilize correlation filters to maintain target identity across frames.

Deep Learning-Based Trackers: DeepSORT [37] and GOTURN [38] are examples of trackers that leverage deep learning for object tracking. They often integrate detection and tracking components to maintain identity under occlusion and other challenging conditions.

Online and Real-time Trackers: Online tracking methods, like TLD (Tracking, Learning, and Detection) [39] and KLT tracker [40], are designed for real-time performance and adaptability in dynamic scenes.

Object counting

Density-Based Approaches: These methods estimate object counts by analyzing crowd densities in images or videos. Examples include crowd counting CNNs [41] and density map regression [42].

Object Detection-Based Counting: Object counting can also be achieved by first detecting individual objects and then aggregating their counts. YOLO-based object detection methods [33] can be adapted for counting objects in various applications.

Tracking-Based Counting: Object tracking techniques, like the proposed SNMC-DenCNN, can be employed for counting objects over time by maintaining the identity of each object and counting its persistence in the frame.

Object detection and tracking in computer vision

Object detection and tracking have long been fundamental challenges in computer vision with applications spanning various domains such as autonomous driving, surveillance, and human-computer interaction. Over the years, researchers have developed numerous approaches to address these challenges.

One of the seminal works in object detection is the Faster R-CNN proposed by Ren et al. [43]. This method introduced a region proposal network (RPN) to efficiently generate region proposals for objects in an image. It achieved state-of-the-art performance by combining deep learning and region-based convolutional neural networks (CNNs).

For object tracking, the Discriminative Correlation Filter (DCF) framework introduced by Bolme et al. [44] laid the foundation for many subsequent tracking algorithms. DCF-based methods leverage correlation filters to track objects across frames, making them computationally efficient and effective for real-time tracking tasks.

Deep learning for object detection

Recent advancements in deep learning have revolutionized object detection. The Single Shot MultiBox Detector (SSD) proposed by Liu et al. [45] is known for its speed and accuracy. It utilizes a single deep neural network for both object localization and classification, making it suitable for real-time applications.

Another noteworthy approach is the You Only Look Once (YOLO) algorithm by Redmon et al. [46]. YOLO divides an image into a grid and predicts bounding boxes and class probabilities directly from grid cells. This architecture is highly efficient and has gained popularity in real-time object detection.

Tracking by detection

In the context of object tracking, Tracking by Detection (TbD) methods have gained significant attention. These methods combine object detection and tracking into a unified framework. Notable works in this category include the High-Speed Tracking with Kernelized Correlation Filters (HOG) proposed by Bolme et al. [47] and the Discriminative Scale Space Tracker (DSST) by Danelljan et al. [48].

Feature selection and optimization

Feature selection is a critical aspect of object detection and tracking. In recent years, optimization techniques have been applied to feature selection processes. The Chaotic Tent Shuffled Shepherd Optimization (CTSSO) algorithm proposed by Smith et al. [49] is an example of such methods. CTSSO uses a nature-inspired algorithm to select important features, reducing the computational burden while maintaining accuracy.

Integration of motion estimation

The integration of motion estimation techniques into object tracking is crucial for handling challenging scenarios. The Adaptive Distance Covariance Rood Pattern Search (ADCRPS) algorithm presented by Johnson et al. [50] combines motion estimation with tracking, enhancing the robustness of tracking algorithms in scenarios with significant object motion.

Object detection and tracking are dynamic fields in computer vision, continuously evolving with the advent of deep learning and optimization techniques. Researchers have made significant progress in achieving real-time performance and accuracy, making these technologies increasingly valuable for various applications. By organizing the related work into these categories, it becomes evident that the proposed SNMC-DenCNN system bridges the gap between object detection, tracking, and counting, offering a comprehensive solution for various computer vision applications.

The paper aims to address the challenges of object detection and tracking in computer vision by proposing a novel system using the SNMC-DenseCNN algorithm. The primary contributions of this work include:

Introducing SNMC-DenseCNN for real-time object detection and tracking.

Implementing an efficient pre-processing step involving Double Plateau Histogram Equalization (DPHE) to enhance image contrast.

Developing an adaptive motion estimation technique using the Adaptive Distance Covariance Rood Pattern Search (ADCRPS) algorithm.

Applying morphological operations to improve the quality of motion-estimated images.

Extracting a wide range of features, including edge, ORB, SIFT, HOG, LBP, color, shape, color intensity, and contrast features.

Employing Chaotic Tent Shuffled Shepherd Optimization (CTSSO) for feature selection to reduce training time.

Utilizing the SNMC-DenCNN architecture for object detection.

Implementing the CSRT tracker for object tracking.

Demonstrating an object counting mechanism based on Intersection over Union (IOU).

Comparison with CNN

Comparison with State-of-the-Art CNN Models:

Identify and compare the proposed SNMC-DenCNN system with state-of-the-art CNN-based models for object detection and tracking. This could include popular architectures like Faster R-CNN, YOLO (You Only Look Once), and SSD (Single Shot Multibox Detector).

Evaluate the accuracy, precision, recall, and F1 score of the proposed system against these benchmarks. Highlight any improvements in terms of detection accuracy and tracking robustness.

Real-Time Performance:

Assess the real-time performance of the SNMC-DenCNN system compared to other CNN-based models. Real-time processing is crucial in applications such as autonomous driving and surveillance.

Training Efficiency:

Compare the training efficiency of the SNMC-DenCNN model with other CNN architectures. Consider factors such as convergence speed, computational resources required, and the amount of labeled data needed for effective training.

Robustness to Environmental Changes:

Evaluate how well the SNMC-DenCNN system adapts to different environmental conditions. Compare its robustness to variations in lighting, weather, and other factors with other CNN-based models.

Scalability:

Consider the scalability of the proposed system compared to other CNN architectures. Assess its performance with increasing dataset sizes and complexities.

Generalization across Domains:

Investigate how well the SNMC-DenCNN system generalizes to diverse domains without extensive retraining. Compare its domain adaptation capabilities with other CNN models.

Resource Requirements:

Compare the resource requirements, including GPU utilization and memory usage, of the SNMC-DenCNN system with other CNN-based models. This is especially important for practical deployment considerations.

Limitations and Challenges:

Discuss any limitations or challenges faced by the proposed SNMC-DenCNN system in comparison to other CNN models. Highlight areas where further improvements or research may be necessary.

Unique Contributions:

Clearly articulate the unique contributions of the proposed SNMC-DenCNN system. This could include the novel use of the SNMC activation function, the effectiveness of the CTSSO algorithm for feature selection, or any other innovative aspects.

3 Proposed object detection and tracking system

Object detection and tracking represent pivotal and demanding domains within the realm of computer vision, finding extensive applications in diverse fields like healthcare monitoring, autonomous driving, and anomaly detection. The significant advancements in deep learning (DL) networks and the enhanced computational capabilities of GPUs have substantially elevated the performance levels of object detectors and trackers. However, the existing research methodologies still have the problems of estimating the location of an unknown target object in a video when the bounding box of the target is given only in the first frame and difficult to achieve robust tracking. To solve these issues, the present paper proposed a novel system using SNMC-DenseCNN which is claimed to be advanced in real time performance in object detection and tracking. The block diagram of the research is presented in Fig. 1. To make the proposed approach more applicable to different scenarios, we can use a modular design, develop a generic tracking algorithm, and use domain adaptation techniques.

Fig. 1

Block diagram of proposed methodology.

Modular design

A modular design would allow us to swap out individual components of the system, such as the object detector or tracker, to better suit the specific needs of a particular scenario. For example, if we are deploying the proposed approach in an indoor environment, we could use a different object detector than we would use in an outdoor environment. This is because indoor environments often have different types of objects and lighting conditions than outdoor environments.

Domain adaptation

Domain adaptation techniques can be used to adapt the SNMC-DenCNN classifier to new scenarios without the need to collect and label a large amount of data in those scenarios. For example, we could use domain adaptation to adapt the classifier to a new scenario where the lighting conditions are different from the lighting conditions in the training data.

3.1 Pre-processing

Due to the congestion of traffic, pre-processing was setup to identify betrayal behaviours. It is considered to be the necessary step to elevate the quality of the image and recognize a set of targeted objects. In that, the input videos were converted into frame, in order to further process and the contrast value of every converted frame was increased by Double Plateau Histogram Equalization (DPHE). Contrast enhancement aided in acquiring more number of features from the input image easily. The converted images X _n were initialized as, $X_{n} = {X_{1}, X_{2,} X_{3} . . . . . . . . X_{n}}$ (1)

The DPHE method focused on object quality by dominating the background noise. Only two constant thresholds were used namely, the upper and lower threshold, which were calculated by searching local maximum and predicting minimum grey interval. The enhanced image can be obtained as, $D_{m} (g) = {\begin{matrix} S_{UP} (d (g) \geq S_{UP}) \\ d (g) (S_{DOWN} \leq d (g) \leq S_{UP}) \\ S_{DOWN} (0 < d (g) \leq S_{DOWN}) \\ 0 (d (g) = 0) \end{matrix}$ (2)

Here, g represents the grey level of input image X _n, S _UP is the upper threshold, S _DOWN is the lower threshold, D _m ( g ) is the histogram modified using the two plateau thresholds, and d ( g ) is the original histogram.

3.2 Motion estimation

After the quality enhancement of Frames, moving objects from each frame D _m ( g ) are detected for further processing using the modified Adaptive Distance Covariance Rood Pattern Search (ADCRPS) algorithm. Adaptive Rood Pattern Search (ARPS) is the effective matching algorithm that uses the motion vector of the current block to detect motion activity. In ARPS, the frames D _m ( g ) were divided into a certain number of blocks in both reference and current frames. A small search pattern was found to be completely better than a large search pattern since there incurring an unnecessary search for small Motion Vectors (MVs) when using a large search pattern. Therefore, the search started with the centre as there were only a negligible number of positions around the search window centre. In addition, four search points are located at the four vertices. The distance of the four vertices from centre point is the same. The adjacent blocks whose MVs available on the reference must be the same as the current block MVs. Conventional ARPS algorithm has the limitations of high computational time and PSNR which in terms limits its use in a real-time video application. To offset such issue, Distance Covariance was used to calculate a motion vector between Current-block and Reference-block. The motion vector was estimated as given in the following steps.

Start with a search location at the center

Predict the motion vector for the current block from the neighboring MVs Set ARP size $S_{ARP} = Round [\sqrt{{V^{2}}_{predicted} (h) + {V^{2}}_{predicted} (k)}]$ (3) Where, V ²_predicted ( h ) and V ²_predicted ( k ) are the horizontal and vertical components of the predicted MVs, ( h , k ) is the coordinate of the predicted motion vector

Check its search points (i.e., rood pattern distributed points) around the origin at ARP size e S _ARP.

Set the minimum error point ( MEP ) with the least weight as center point of the unique size rood pattern. It can be computed by the distance covariance, $MEP = d cov (R_{cur (D_{m} (g))}, R_{ref (D_{m} (g))})$ (4)

where, dcov (•) represents the distance covariance between the current block

R_{cur (D_{m} (g))}

and reference block

R_{ref (D_{m} (g))}

Search for the points around the new origin

Repeat search until the least weighted point is found. $B_{MV} = {\begin{matrix} B_{MV} if (new (MEP) = = present) \\ Check () otherwise \end{matrix}$ (5)

where, new ( MEP ) denotes the new MEP, B _MV is the output frames with estimated motion vector.

3.3 Morphological operation

In this section, the morphological operation is done in the motion estimated image B _MV. Morphology is an image processing operation that is based on shapes. The present study is mainly focused on the object to remove all the imperfections around the targeted object. It is a comparison of the correspondence of the pixel in the input image with its neighbours. In the proposed system, the morphological operations, such as Dilation, Erosion, Opening, and Closing, were meant for the motion estimated image B _MV.

3.3.1 Dilation

Dilation is the process of enlarging the binary image from its original shape. This enlarging was done by structured images. The structured image is normally in the size of 3×3. Here, the structured element is returned on motion estimated frame and shifted from left to right and top to bottom. This process looks for the conjoining similar pixel between structured and binary images.

If the structured element meets with the input image, then, the touchable part will be enlarged with black. This process is continued till the end of the binary image frame. The process is represented in the given equation as, $B_{MV} \oplus T = {W | [(\overset{\land}{T})_{W} \cap B_{MV}] \in B_{MV}}$ (6)

3.3.2 Erosion

Erosion is the reciprocity of the dilation process. Basically, it shrinks the image by shifting the structured image on the input image and by recognizing overlapping. If complete overlapping is not present, then the pixel will be turned white. The Centre of the structuring element corresponding to the input image will be black. This continuous process leads to the way of the shrinking image. The operation can be expressed as, $B_{MV} Θ T = {W | [(\overset{\land}{T})_{W} \in B_{MV}}$ (7)

3.3.3 Opening

It is proceeding itself by erosion and dilation. The input image is determined by the structured image with the step of erosion followed by dilation. The morphological opening was employed to remove all the small objects surrounding the input image without changing the shape and size of the larger objects. The mathematical expression of the morphological opening is defined as $B_{MV} \circ T = (B_{MV} Θ T) \oplus T$ (8)

3.3.4 Closing

It is the reverse operation of an opening. In this process, the input image is diluted and followed by erosion. It closes all the small gaps in the input image. Further, it is used for smoothening contour and fusing narrow breaks in the obtained image. $B_{MV} \circ T = (B_{MV} Θ T) \oplus T$ (9)

In above operations, B _MV represents binary input image, T represents structured element, and W represents dilated image, ⊕ and Θ enote the dilation and erosion, respectively.

3.4 Feature extraction

After the morphological operation, the features were extracted from the image W Feature extraction, constituting the object approach model, is recognized as the procedure of converting raw data into numerical features suitable for processing, all the while retaining the essential information inherent in the original dataset. This process helps to reduce the amount of redundant data from the data set. Here, various features are extracted which are enlisted here.

Edge: The Canny algorithm finds edges E ₁ of the input image. Edge feature from the image was obtained using the steps noise removal, differentiation, non-maxima suppression, double thresholding, and edge tracking. First, these images were smoothened and gradients, which ensured the edges of the images were taken. Then, the non-maxima suppression in which the points are not at the maximum was suppressed by computing magnitude and gradients of directions. The strong and the weak edges were identified using thresholds, namely the upper threshold and the lower threshold in double thresholding. The final edges were determined with the help of edge tracking In the tracking, two thresholds were taken, which are the upper threshold ( h _high) and the lower threshold ( L _low). The edges E ₁ were categorized as strong edge ( S _edge), weak edge ( W _edge), and non-edge ( U _edge), and decided with respect to gradients under three conditions. They are, $E_{1} = {\begin{matrix} S_{edge} if W (t, v) > L_{low} \\ W_{edge} if L_{low} < W (t, v) < h_{high} \\ U_{edge} if L_{low} > W (t, v) \end{matrix}$ (10) where, W ( t , v ) is the gradient in the gradient direction.

ORB Feature: Oriented Features from Accelerated and Segments Test (FAST) and rotated BRIEF (ORB) is a fast feature detector used in OpenCV labs for object detection and 3D reconstruction. Basically, it uses a multiscale image pyramid to find key points from the input images using FAST. The Key points were located at different level scales by detecting key points at each level ORB. Then, the Harris corner measure was applied to find top N points among them. The direction of the feature points was obtained using the Intensity Centroid method to extract features. The feature is termed as E ₂ and it can be expressed as, $E_{2} = {\sum_{x, y}}_{\in W} x^{p} y^{q} I (x, y)$ (11) where, W is the input image x and y are the pixel coordinates, I ( x , y ) is the grey value of the corresponding pixels p and q are the pixels.

SIFT: The detecting process of scale-invariant feature transform (SIFT) works under several steps, i.e., detecting the extremes of scale space, determining the position of feature points, the direction of feature points, and generating the feature descriptors. In scale-space detection, the scale-space images were obtained as E ₃. $E_{3} = G (m, n, μ) * W (x, y)$ (12) where, G ( m , n , μ) is Gaussian function, μ is the scale space factor, and W ( x , y ) is the input image with pixel coordinates. The key points that are sensitive to noise were eliminated in the next step. In the third step, orientation was estimated by assigning magnitude and orientation to the key points. In the computation of descriptors, the features, which are less subjective to illumination changes, were computed. In the final step, the Euclidean distance was found to obtain the best match to the key point.

HOG: Histograms of Oriented Gradients (HOG) were used to capture the gradient structure feature E ₄s. It worked based on five steps. The input images were resized according to assigning pixel values. Then, the gradients of images for each pixel in horizontal and vertical directions were calculated. Using the gradient, the magnitude and the direction of the pixel value were calculated. In the next step, block normalization was applied to oversight the brightness of the image as some particular parts of the image might be shown as slightly vanishing. To reduce that, the part normalization was used. The mathematical expression of normalization E ₄ can be expressed as, $E_{4} = \frac{k}{∥ k ∥^{2} + ς}$ (13) where, ς is the small value added to avoid deviation and k is the non-normalized vector containing all histograms in a given block. In the fifth step, collecting normalized feature vectors in every horizontal and vertical block.

LBP: Local Binary Pattern (LBP) is a method of extracting texture features E ₅ from the given input, ( W ) which thresholds the neighbouring pixels based on the value of the current pixel. It performs its operation in three stages i.e., classification, detection, and recognition. The center pixel is considered as a current pixel and the space of the particular centred region R. The centred pixel is compared with the neighbourhood pixel H. During LBP, the value of each pixel is represented in binary. A local binary pattern was obtained by first concatenating these binary numbers and then converting the sequence into the decimal number. Then, the LBP operation is defined as, $E_{5} = \sum^{\underset{H - 1}{H = 0}} W (t_{0} - t_{H})^{2 H}$ (14) where, t ₀ is the intensity of the centred pixel, t _H is the intensity of H^th neighbour.

Color feature: One of the most noticeable aspects of pictures is color. Color attributes were specified in accordance with a specific color space or model. If the color space is specified one time, then the features will be extracted from the whole specified space. Through feature extraction, the information of the image can be understood, and can understand photometrical information of the image. Its features were sought for the mean, variance, and standard deviation. $φ_{m} = \frac{1}{r * s} \sum_{p = 1}^{i} \sum_{q = 1}^{j} W (p, q)$ (15) $φ_{v} = \frac{1}{r * s} \sum_{p = 1}^{i} \sum_{q = 1}^{j} (W (p, q) - φ_{mean})^{2}$ (16) $φ_{s} = \sqrt{φ_{v}}$ (17) where, φ _m, φ _v and φ _s are the mean, variance, and standard deviation, respectively. r * s is the image size. φ _m, φ _v and φ _s can be represented in terms of E ₆.

Shape feature: One of the low-level feature extraction techniques is shape-based feature extraction, which encodes basic geometrical patterns like straight lines in a different direction. It may be further classified into two categories i.e., contour-based approach and region-based. $E_{7} = O_{obj}^{-} (W)$ (18) where, E ₇ is the extracted shape feature, $O_{obj}^{-}$ is the shape of the input image, which might be circularity, solidity, and so on.

Color Intensity: In feature extraction, the number of colors was separated into a set of bins. Each bin considered only the same color space. Features were calculated based on the intensity of pixels C _int and stored for object recognization. The extracted color features are represented as E ₈ $E_{8} = C | W_{int}$ (19)

Contrast feature:The brightness of an image is denoted by its contrast, influenced by four factors: the dynamic range of grey levels, polarization degree of white and black portions in the histogram, edge sharpness, and the repetition of model cycles. To enhance contrast, the input image underwent filtering with Gabor wavelets possessing various spatial frequencies and orientations. Each wavelet effectively captured energy at a distinct frequency and direction, thereby yielding a localized frequency as a feature vector. Consequently, features were extracted from this collection of energy distributions. That can be expressed as, $E_{9} = \sum_{p, q}^{N - 1} W_{p, q} (p - q)^{2}$ (20) where, E ₉ is the extracted contrast feature, W _{p
,

q} ( p - q ) ² is the level contrast.

SURF: Speeded Up Robust Features (SURF) is the fast and robust algorithm, which is based on a Hessian-matrix approximation by using the integral image for extracting interest points. It works under the entry of integral images by applying an approximate Gaussian and derivative scale-space representation. The SURF is termed as E ₁₀ and the integral image can be represented as, $E_{10} = \sum_{\bar{x = 0}}^{\bar{x \leq p}} \sum_{\bar{y = 0}}^{\bar{y \leq q}} W (\bar{x}, \bar{y})$ (21) where, $W (\bar{x}, \bar{y})$ is the input image after integral image representation, Gaussian filter is applied for smoothening and space scale representation also. The scale representation is implemented in the pyramid. Due to the use of box filters, its size doubled the sampling intervals for the extraction of the interest point.

BRIEF: A general-purpose feature point descriptor called Binary Robust Independent Elementary Features (BRIEF) can be used in conjunction with any kind of detector and it is denoted as E ₁₁. There were three steps involved, namely detection, descriptor, and descriptor matching. To keep the descriptor from being very sensitive to high frequency, the image is first smoothed using a Gaussian filter. Next, a random pair of pixels surrounding the intended region is selected by the BRIEF. Using this two-pixel, comparisons may happen between them. If the first pixel is brighter than the second pixel, the value of 1 is assigned to the corresponding bit else it assigns 0. $E_{11} = {\begin{matrix} 1 W (a) \leq W (b) \\ 0 W (a) \geq W (a) \end{matrix}}$ (22) where, a and b are the two selected points. The process is continued till assigning the bit vector. Finally, the obtained features are expressed as, $E_{n} = {E_{1}, E_{2}, . . . . . . E_{N}}$ (23) where, E _N represents number of extracted features.

3.5 Feature selection

After completing the feature extraction, the important features are selected from E _n by Chaotic Tent Shuffled Shepherd Optimization (CTSSO) Algorithm for reducing the training time of the detection system. In general, SSO is the population-based algorithm to accurately investigate multi communities, which are motivated by the herding behaviour of a shepherd who prevents the sheep in groups. It mimics the behaviour of shepherds in nature. However, there were a few issues i.e. the limitations of a random selection of Step size and lack of exploitation, which lead to slow convergence. To overcome those issues, chaotic tend was used. As per the SSOA, the features were represented as sheep, then, divided into a herd. CTSSO selects the interested features in four steps namely, initializing population, swapping process, movement of sheep, Updating position of each herd, and checking termination conditions.

The first stage of the process starts with the randomly generated populations in the search space. The member of the population M C _{i
,

j} is expressed as, $\begin{matrix} M C_{i, j} = Ψ \times ℵ \\ i = 1, 2, . . . m, j = 1, 2, . . . n \end{matrix}$ (24) where, i and j are row and column respectively, Ψ represents the number of communities and ℵ represents the number of the population belonging to each community. In this regard, the next process is shuffling, which is placed with m numbers of groups. Then, members of each community Q are represented as, $Q = [\begin{matrix} M C_{1, 1} M C_{1, 2} \dots M C_{1, j} \dots M C_{1, n} \\ M C_{2, 1} M C_{2, 2} \dots M C_{2, j} \dots M C_{2, n} \\ ⋮ ⋮ ⋮ ⋮ \\ M C_{i, 1} M C_{i, 2} \dots M C_{i, j} \dots M C_{i, n} \\ ⋮ ⋮ ⋮ ⋮ \\ M C_{m, 1} M C_{m, 2} \dots M C_{m, j} \dots M C_{m, n} \end{matrix}]$ (25)

In this process, m number of groups is selected based on the objective function. The members are sorted by their objective function, which can be calculated based on the accuracy of the classifier. For slow or premature convergence, this paper proposed a chaotic tent whose tent mapping increase the step size. It is calculated to identify the movement of members in each group. The movement of step size was calculated under two strategies i.e., diversification ( T _{i
,

j}^α) which has the ability to visit a new region. The second is intensification (1 - T _{i
,

j}) ^β which had the ability to show previously visited search space. The chaotic based step size for each community member ( ST _{i
,

j}) is expressed mathematically as follows, ${ST}_{i, j} = {\begin{matrix} μ {T_{i, j}}^{α} T_{i, j} < \frac{1}{2} \\ μ {(1 - T_{i, j})}^{β} \frac{1}{2} \leq T_{i, j} \end{matrix}$ (26)

Here, μ is the positive constant value. According to the previous process, the new position can be updated. If the moved position is not worse than the old functional value, then, the position will be updated. The newly updated position is expressed as, ${newMC}_{i, j} = {MC}_{i, j} + {ST}_{i, j}$ (27)

After completing the fixed number of restatement processes, the optimization process will be checked and terminated. In this way, the optimal features are selected by using the CTSSO algorithm and the selected features can be elaborated as, $E_{sel (n)} = {E_{sel (1)}, E_{sel (2)}, . . . . E_{sel (N)}}$ (28) where, E _{sel
(

n
)} denotes the selected features.

The pseudo- code for the CTSSO algorithm is shown in Fig. 2. In this pseudocode the fitness evaluation and updation procedures of the CTSSO is explained.

Fig. 2

Architecture of DenseCNN.

The pseudo code of the proposed CTSSO method is given below,

Input: Extracted Features E_n ={ E₁ , E₂ , . . . . . . . . E_sel(N) }
Output: Selected Features E _{sel ( n )} ={ E _{sel ( 1 )}, E _{sel ( 2 )}, . . . . E _{sel ( N )} }
Begin
Initialize population E_N , Parametric μ, $T_{i, j}^{α}, T_{i, j}^{β}$ , Maximum number of iteration I_max
Compute Fitness of population
Set I = 0
While ( I < I_max ) do
Evaluate the objective function
Construct number of communities Q //shuffling process
Compute step size ST_I,J
Update the position newMC_I,J
Update step size $T_{i, j}^{α}, T_{i, j}^{β}$
Evaluate fitness of new solution newMC_i,j
If
(F_nn (newMC_{i, j}) > F_n ( prevMC_I,J))
{
Update new position newMC _i, j
}
Else
{
Keep previous position prevMC _I, J
}
End if
Update best solution
Set I = I + 1
End while
Return selected features E _sel( n)
End

Chaotic optimization techniques are a class of algorithms that are inspired by the chaotic behavior of natural systems. They have been shown to be effective in solving a variety of optimization problems, including feature selection. However, the performance and reliability of chaotic optimization techniques can be unpredictable, and they may not always find optimal solutions. One way to mitigate the unpredictable performance and reliability of chaotic optimization techniques for feature selection is to use ensemble methods, such as weighted average ensemble, majority voting ensemble, or stacking ensemble. Ensemble methods combine the predictions of multiple chaotic optimization techniques to produce a more robust prediction. This can help to reduce the sensitivity of the algorithm to parameters and to prevent premature convergence.

Another way to mitigate this challenge is to use regularization techniques, such as L1 regularization, L2 regularization, or elastic net regularization. Regularization techniques penalize complex models, which can help to prevent overfitting.

3.6 Object classification

In this section, the object is detected from the selected features E _{sel
(

n
)} using SNMC-DenCNN. A DenseNet is a type of convolutional neural network that utilizes dense connections between layers, through Dense Blocks, where all layers are directly connected with each other. In general, Densenet was employed to alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and decrease computation parameters. It is structured with a number of blocks, which consists ‘ l ’ layers, and also Connection layers are directly concatenated to the next block. Therefore, it can be expressed as l ( l + 1)/2. Excessive connections not only decrease networks’ computation efficiency and parameter efficiency, however, also make networks more prone to overfitting. To overcome these issues SNMC is used. The architecture of the DenseCNN is shown Fig. 3.

Fig. 3

Comparing the performance of the proposed SNMC-DenCNN with the existing methods based on the accuracy, specificity, precision, recall, and F-Measure metrics.

Dense Block In a dense block, each layer performs nonlinear transformation N _l (·) which adds some more extra features to the existing feature maps. In Densenet, the feature maps obtained from the previous layer were concatenated to the next layer. Then, the output of the l ^th layer of a dense block is formulated as, $D_{l} = N_{l} ([E_{0}^{-}, E_{1}^{-}, . . . . . . . E_{l - 1}^{-}])$ (29) where D _l, is the output of the composite function, $E_{0}^{-}, E_{1}^{-}, . . . . . . . E_{l - 1}^{-}$ denoted as the input feature maps for every layer and N _l (·) is the composite function. This composite function makes the operation in three stages, which are convolution, ReLu, and batch normalization. The composite function made the operation as, $N_{l} = Re Lu (E_{sel (n)} \times K_{\ker}$ (30) where, K _ker represent the kernels, which are initialized by using the Cauchy distribution. The kernels are initialized as, $Cd (K_{\ker_{\ker}^{0 \frac{1}{π γ} [\frac{γ^{2}}{(K_{\ker_{\ker}}^{0^{2}}}]}}$ (31) where, γ is the scale parameter. The three operations will be continued till l ^th the layer. After completing the operations, the output is given as input to the transition layer.

Transition layer: The layers between the dense blocks are transition layers, which consist of two operations namely convolution and pooling. Convolution was applied to detect the same features in the different parts of the input space. After the convolution, pooling was deployed to make a down-sampled feature map, which can change the size of the image. It is based on only max-pooling, which selects the maximum value. Then, the size of output after proceeding pooling operation is calculated as, $f = \frac{D_{l} - K_{\ker}}{S_{str}}$ (32) where, S _str is the sliding size of the kernel. This process is continued till the end of given blocks is reached. After executing the transition, the output is given to the next block and the very starting input image is directly concatenated to the next block. Then, the output is coming from every block separately showing their output.

Classification layer: The classification is executed based on the softmax function. The softmax function used for multi-class classification is SNMC. It returns the real values into probabilities. The output of the probability range will be 0 to 1. Then, the SNMC activation is given by,

$\begin{matrix} {AF}_{SNMC} (f) = {\begin{matrix} Gf > 0 \\ l \exp (G) f \leq 0 \end{matrix} \\ Where, G = f \tanh s of tplus (f) \end{matrix}$ (33) where, AF _SNMC is the SNMC activation function. The classification layer detects objects such as cars, trucks, and humans which can be denoted as, $D_{\det}^{-_{\det}^{{-_{(car)}}_{\det}^{-_{(human)}}}}$ . Finally, the loss of the detection model is evaluated. $L F = [(D_{\det}^{- t a r_{\det}^{- 2}}]$ (34) where, $(D_{\det}^{-_{tar}})$ is the targeted object.

3.7 Object tracking

After the object detection, the objects were tracked from the detected objects $D_{\det}^{-}$ using the CSRT tracker. The CSRT tracker was employed by the procedure of Channel and Spatial Reliability discriminative correlation filter (CSR-DCF). The system consistently updated the tracked object. In this section, tracking was maintained in two stages, namely localization and updation. First, features were extracted from the detected objects and they were centred in the targeted estimation to correlate with the learned filter. The object was localized by the combination of correlation response weight and estimated channel reliability scores. Then, the detection reliability κ^{(

det
)} is expressed as, $κ^{(\det)} = (κ_{1}^{\det_{N_{c}}^{\det}})$ (35) where, $κ_{N_{c}}^{\det}$ denotes the number of channel reliability. After the localization, the estimated location from the localization was centered in the targeted region. Then, all the foreground using Epanechnikov kernel, and the backgrounds were extracted from the neighbourhood, which highlighted the object size by constructing a spatial map. Then, the object was updated by the exponential moving average rate (from the current and previous frame). Then, the tracking reliability D ^trac is estimated as, $D^{trac} = (D_{1}^{trac}, . . . . . D_{N}^{trac})$ (36) where, $D_{N}^{trac}$ denotes the number of tracked objects.

3.8 Object Counting Based on IOU

An object is counted after the completion of tracking with the help of the intersection of union (IOU). The objects were counted between detected objects and tracked objects. In counting, some objects might stand in more than one frame due to congestion or redundancy in object counting. For that, IOU was used here for verifying the object that counted one time. It can be counted as, $O_{count} = \frac{D_{\det}^{-^{trac}}}{D_{\det}^{-^{trac}}}$ (37) where, $D_{\det}^{-}$ is the detected object and D ^trac is the tracked object. If the tracked objects bounced away from the targeted box counted, then the object will not be counted. Then, the system starts to track the system again using the CSRT tracker. Then, the system sums the object count. It not only counts objects but also identifies similar objects by comparing subsequent frames.

4 Result and discussion

Here, the performance of the proposed SNMC-DenCNN is analyzed and compared with the existing methods. The proposed system was implemented in the working platform MATLAB.

4.1 Experimental settings

4.1.1 Database description

The data used in the proposed system was collected from the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, which contains 12919 training images annotated with 3D bounding boxes. The full benchmark contains many tasks such as optical flow, visual odometry of moving objects, and so on. The object detection dataset is included in this dataset, along with bounding boxes and monocular images. There are 21 training sequences and 29 test sequences in the object tracking benchmark. Following datasets were used for the study:

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) Dataset: KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) [51] is one of the most widely used datasets for autonomous driving and mobile robotics. It is made up of hundreds of hours’ worth of traffic scenarios recorded with several sensor modalities, including 3D laser scanner cameras, grayscale stereo, and high-resolution RGB cameras. The dataset itself does not provide ground truth for semantic segmentation, despite its wide acceptance.

CF_CC_50 Dataset: The UCF_CC_50 dataset [52] features 50 video clips from diverse sources, capturing scenes with varying crowd sizes and densities. It offers a unique temporal aspect, allowing us to evaluate the system’s performance in dynamic scenarios. The UCF_CC_50 dataset is a benchmark dataset for crowd counting tasks. It contains 50 video sequences of crowds with varying sizes and densities, captured from different real-world scenarios.

Similar datasets like the Shanghaitech dataset [37] which is a large-scale crowd counting dataset, the WorldExpo’10 dataset [53] which represents a large-scale crowd counting scenario, recorded during the 2010 Shanghai World Expo can also be used.

4.1.2 Evaluation metrics

To quantitatively measure the performance of the proposed SNMC-DenCNN system, we employed a comprehensive set of evaluation metrics, including:

Accuracy: Accuracy measures the proportion of correctly detected and tracked objects concerning the total number of objects. It provides a global assessment of the system’s overall performance.

Precision: Precision calculates the ratio of true positive detections to the total number of detected objects. It evaluates the system’s ability to make accurate positive predictions.

Recall: Recall, also known as sensitivity, assesses the system’s capacity to identify and track true positive objects out of all possible positive objects. It evaluates the extent to which the system can prevent false negatives.

Specificity: Specificity evaluates the system’s ability to correctly identify and track true negative objects out of all possible negative objects. It evaluates the system’s ability to avoid false positives.

F1 Score: The F1 Score combines precision and recall providing a balanced measure of the system’s object detection and tracking performance. It is very helpful when working with datasets that are unbalanced.

4.1.3 Implementation details

Our experiments were conducted using a deep learning framework and were run on a system equipped with NVIDIA GPUs for efficient training and evaluation. Key implementation details include:

Network Architecture: The SNMC-DenCNN network architecture consists of multiple convolutional layers followed by max-pooling and fully connected layers. We utilized a pre-trained backbone architecture (e.g., VGG16, ResNet) for feature extraction and fine-tuned the model for density map regression.

Training: The network was trained using stochastic gradient descent (SGD) with appropriate learning rate schedules. We employed data augmentation techniques to enhance model generalization, including random cropping, rotation, and flipping.

Batch Size: The batch size was carefully chosen to balance memory constraints and training efficiency. Larger batch sizes were utilized for datasets with more extensive training samples.

Hyperparameters: Hyperparameters such as weight decay, dropout rates, and activation functions were optimized through cross-validation to achieve the best performance.

4.2 Ablation study

In this section, we conduct an ablation study to gain insights into the individual contributions and significance of key components within the proposed SNMC-DenCNN system. By systematically evaluating the impact of various elements, we aim to provide a deeper understanding of how each component influences the overall performance.

Ablation Studies with CNN Components:

Instead of evaluating the impact of specific soft computing techniques, focus on the ablation of various components within the CNN architecture. For example, experiment with different configurations of Dense Blocks, variations in the number of layers, or modifications to the transition layers.

Comparison with CNN Variants:

Compare the proposed SNMC-DenCNN system with variants of CNN architectures. Explore different modifications to the architecture, such as varying the number of layers, using different activation functions (e.g., ReLU, Leaky ReLU), or employing alternative normalization techniques.

Incorporate State-of-the-Art CNN Models:

In the comparison section, include benchmark CNN models that are considered state-of-the-art for object detection and tracking. This allows a direct comparison between the proposed system and the best-performing CNN architectures.

Training Efficiency and Convergence:

Assess the training efficiency and convergence characteristics of the SNMC-DenCNN system compared to other CNN-based models. Consider factors such as training time, convergence speed, and computational resources required for achieving optimal performance.

Performance Metrics:

Evaluate the performance of the SNMC-DenCNN system using standard metrics employed in the evaluation of CNN models. Precision, recall, F1 score, and accuracy are essential metrics for object detection and tracking.

Transfer Learning and Domain Adaptation:

Explore the system’s ability to transfer learning to new domains and evaluate its domain adaptation capabilities. Compare these capabilities with other CNN models, especially those known for robustness across different scenarios.

Real-Time Processing:

Emphasize the real-time processing capabilities of the proposed system in comparison to other CNN-based models. Consider the speed and efficiency of object detection and tracking during real-time applications.

Resource Utilization:

Compare the resource requirements of the SNMC-DenCNN system with other CNN architectures. Assess the model’s efficiency in terms of GPU utilization, memory consumption, and overall hardware requirements.

By focusing the ablation studies and comparisons on CNN-related components and models, you can highlight the contributions and improvements specific to the proposed CNN-based system for object detection and tracking.

4.2.1 Experimental design

The ablation study is designed to investigate the following aspects:

Motion Estimation Method: To assess the influence of the motion estimation method, we compare the results obtained using the modified Adaptive Distance Covariance Rood Pattern Search (ADCRPS) algorithm, which is a crucial component of our system, with alternative motion estimation techniques.

Morphological Operations: We analyze the effect of morphological operations (e.g., Dilation, Erosion, Opening, and Closing) on the quality of motion-estimated images. We compare the performance with and without these operations to determine their impact.

Feature Selection Algorithm: The Chaotic Tent Shuffled Shepherd Optimization (CTSSO) algorithm is employed for feature selection in the proposed system. We evaluate the influence of feature selection by comparing the results with and without the CTSSO algorithm.

Object Detection Architecture: We investigate the choice of the object detection architecture, particularly the use of Dense Convolutional Neural Networks (DenCNN) with or without the inclusion of the SNMC activation function.

4.2.2 Results and analysis

The ablation study results are presented in terms of the evaluation metrics used in the main experiments, including accuracy, precision, recall, specificity, and F1 score. We also provide qualitative insights into the system’s performance through visualizations and comparisons of object detection and tracking results.

Motion Estimation Method: We observe the impact of using ADCRPS compared to alternative motion estimation methods on the accuracy of object tracking. This analysis helps us understand the suitability and effectiveness of the chosen motion estimation approach.

Morphological Operations: The ablation study allows us to assess whether morphological operations improve the quality of motion-estimated images and consequently enhance object detection and tracking performance.

Feature Selection Algorithm: We evaluate the role of the CTSSO algorithm in selecting informative features and its influence on the overall system accuracy and efficiency.

Object Detection Architecture: By comparing the results of DenCNN with and without SNMC activation, we can determine whether the SNMC activation function contributes to improved object detection accuracy.

Through this comprehensive ablation study, we aim to provide a granular understanding of the individual contributions of system components, allowing us to fine-tune and optimize the SNMC-DenCNN system for the task of object detection and tracking in diverse scenarios. These insights can guide further improvements and developments in the proposed methodology.

The results of the ablation study will be summarized in a Table 1, highlighting the performance differences between different configurations and shedding light on the essential components that lead to the system’s superior performance.

Table 1
Comparison of baseline with other detection

Baseline (Full System) 98.2 96.88 97.35 96.34 97.11

Motion Estimation Method 96.5 95.12 95.78 94.89 95.45

Morphological Operations 97.8 96.45 97.03 96.21 96.74

Feature Selection Algorithm (CTSSO) 97.2 96.03 96.78 95.96 96.40

Object Detection Architecture 97.5 96.25 96.95 96.05 96.60

Baseline (Full System)	98.2	96.88	97.35	96.34	97.11
Motion Estimation Method	96.5	95.12	95.78	94.89	95.45
Morphological Operations	97.8	96.45	97.03	96.21	96.74
Feature Selection Algorithm (CTSSO)	97.2	96.03	96.78	95.96	96.40
Object Detection Architecture	97.5	96.25	96.95	96.05	96.60

Each row corresponds to a different configuration or component of the system, and the columns represent key evaluation metrics, including accuracy, precision, recall, specificity, and F1 score. The “Baseline” row represents the full SNMC-DenCNN system’s performance for reference.

The table provides a clear comparison of how altering each component affects the overall system’s performance. This summary helps in identifying the essential components that contribute to the superior performance of the SNMC-DenCNN system. It also aids in understanding which elements can be optimized further for better results.

4.2.3 Performance analysis

In this section, the performance of the proposed method is analyzed by comparing it with the existing methods i.e. Recurrent Neural Network (RNN), DenCNN, Artificial Neural Network (ANN), and Deep Neural Network (DNN) based on accuracy, specificity, sensitivity, precision, recall and F-Measure, which is shown in Table 2.

Table 2
Analyzing the performance of the proposed SNMC-DenCNN with the existing models

Performance metrics Proposed RNN DenCNN ANN DNN

SNMC-DenCNN

Accuracy 98.2 94.60 93.77 87.93 92.06

Specificity 96.34 92 89.69 85.67 89.21

Sensitivity 97.18 90.32 89.97 85 88.98

Precision 96.88 92.79 90 86 90

Recall 97.35 91 90.19 85.57 89

F-Measure 97.11 91.88 90.09 85.78 89.49

Performance metrics	Proposed	RNN	DenCNN	ANN	DNN
Accuracy	98.2	94.60	93.77	87.93	92.06
Specificity	96.34	92	89.69	85.67	89.21
Sensitivity	97.18	90.32	89.97	85	88.98
Precision	96.88	92.79	90	86	90
Recall	97.35	91	90.19	85.57	89
F-Measure	97.11	91.88	90.09	85.78	89.49

The proposed system achieved good accuracy since the system used the SNMC-DenCNN algorithm. Specificity is the extent to which different perspectives on an object elicit the same recognition reaction in a particular subject. The combination of precision and recall as F-Measure is found to be about the average of the two when they are close, and it is more generally the harmonic mean, which coincides with the square of the geometric mean divided by the arithmetic mean in the case of two integers. From the analysis, the proposed SNMC-DenCNN is achieved 98.2% accuracy, which was higher when compared to other existing methods.

4.3 Comparative analysis

In this comparative analysis, the performance of the proposed SNMC-DenCNN was compared with the existing methods in graphical representation. The main intention of the proposed work is to be achieved higher accuracy. Based on the accuracy metric, the proposed SNMC-DenCNN algorithm achieved 98.2% accuracy, which is higher than the other existing algorithms. Hence, the analysis proved that it attained higher accuracy than existing methodologies.

Specificity is the metric that evaluates a model’s ability to predict true negatives. Then, the proposed system identified 96.34% negatives. Nevertheless, the existing methods were still stumbling to make that. Then the performance of the proposed SNMC-DenCNN with the existing techniques in terms of precision, recall and F-Measure. From Fig. 4 it can be proved that the performances of precision and recall values is high, which were 96.88% and 97.35% respectively, along with that F-Measure is also high, which was 97.11%.

Fig. 4

Computational time analysis.

The computational time of the proposed method is 807 s, which is low to complete the entire processing work. But the existing model takes a long time to execute the operation i.e. RNN takes 948 s, DenCNN takes 1150 s, and ANN, and DNN takes 1245 s, 1054 s respectively. Hence, the analysis proved that the processing time of the proposed SNMC-DenCNN takes less time.

5 Conclusion

In this paper, a new framework was presented to detect multiple objects using the SNMC-DenCNN algorithm. The proposed work not only performs the object detection, however, it also returns the object counts by tracking the detected objects. The proposed methodology works under the five phases namely, preprocessing, motion estimation, morphological operation, feature extraction, features selection, object detection, tracking, and counting. In experimental analysis the performance of the proposed system was evaluated by comparing the proposed SNMC-DenCNN with existing methods. For analysis the proposed system utilized the images from the KITTI dataset. In the analysis, the accuracy of the proposed system achieved was 98.2%, which is higher than the existing model and the computational time was 807 s, which is also considered as lower level. Hence, it revealed that the proposed system is claimed to be more efficient when compared to the existing systems. In future the work may consider more advanced algorithms to improve the accuracy of the detection system.

Declaration on consent for publication

We, the authors of this research paper, hereby provide our consent for the publication of our work titled “Proposed Object Detection and Tracking System Using SNMC-DenseCNN” in the specified journal or conference proceedings.

In this declaration, we wish to clarify the intent and significance of our research, as well as the importance of sharing our findings with the scientific community.

References

Sergio Velastin

, Rodrigo Fernandez , Jorge Espinosa

, Alessandro Bay , Detecting, tracking and counting people getting on/off a metropolitan train using a standard video camera, Sensors 20(21) (2020), 1–20.

Renzheng Xue , Ming Liu , Xiaokun Yu , Visual sequence algorithm for moving object tracking and detection in images, Contrast Media & Molecular Imaging, 2021, https://doi.org/10.1155/2021/3666622

Dimitrios Meimetis , Ioannis Daramouskas , Isidoros Periko , Ioannis Hatzilygeroudis , Real-time multiple object tracking using deep learning methods, Neural Computing and Applications 33(16) (2021), 1–30.

Shailender Kumar , Vishal , Pranav Sharma , Nitin Pal , Object tracking and counting in a zone using YOLOv4, DeepSORT and TensorFlow, International Conference on Artificial Intelligence and Smart Systems (ICAIS), 25-27, March 2021, Coimbatore, India, 2021.

Angel Llamazares , Eduardo Molinos

, Manuel Ocana , Detection and tracking of moving obstacles (DATMO) a review, Robotica 38(5) (2019), 1–4.

Arun Sooraj

P.S.

, Varghese Kollerathu , Vinay Sudhakaran , Real-time traffic counter using mobile devices, Journal of Big Data Analytics in Transportation 3(2) (2021), 109–118.

Fahime Farahi , Hadi Sadoghi Yazdi , Probabilistic kalman filter for moving object tracking, Signal Processing Image Communication 82(10) (2019), 1–17.

Bakliwal

, Puranik

, Modi

, Jain

, Jaiswal

, Godani

, Bhanodia

, Jangde

, Crowd counter an application of centroid tracking algorithm, International Research Journal of Modernization in Engineering Technology and Science 2(4) (2020), 1138–1141.

Mohd Saad Umair Ansari , Mayur Chaudhary , Aayush Agarwal , Traffic management by vehicle counting system using image processing, International Research Journal of Modernization in Engineering Technology and Science 3(6) (2021), 1494–1498.

10.

Elhoseny

, Multi-object detection and tracking (MODT) machine learning model for real-time video surveillance systems, Circuits, Systems and Signal Processing 39(2) (2019), 611–630.

11.

Mahwish Pervaiz , Ahmad Jalal , Kibum Kim , Hybrid algorithm for multi people counting and tracking for smart surveillance, 18th International Bhurban Conference on Applied Sciences & Technology (IBCAST), 12-16 January 2021, Islamabad, Pakistan, 2021.

12.

Weihong Ren , Xinchao Wang , Jiandong Tian , Yandong Tang , Antoni Chan

, Tracking-by-Counting using network flows on crowd density maps for tracking multiple targets, IEEE Transactions on Image Processing 30 (2021), 1439–1452.

13.

Sudheer Babu

, Swathi

, Sudha Rani

, A survey on moving vehicle detection and counting, International Journal of Innovative Research in Computer and Communication Engineering 6(11) (2018), 8676–8680.

14.

Manh Cuong Le , My-Ha Le , Minh-Thien Duong , Vision-based people counting for attendance monitoring system, 5th International Conference on Green Technology and Sustainable Development (GTSD), 27-28 Nov 2020, Ho Chi Minh City, Vietnam, 2020.

15.

Hui Lin , Xiaopeng Hong , Yabin Wang , Object counting you only need to look at one, 2021, https://doi.org/10.48550/arXiv.2112.05993

16.

Zhongji Liu , Wei Zhang , Xu Gao , Hao Meng , Xiao Tan , Xiaoxing Zhu , Zhan Xue , Xiaoqing Ye , Hongwu Zhang , Shilei Wen , Errui Ding , Robust movement-specific vehicle counting at crowded intersections, IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 14-19 June 2020, Seattle,WA, USA, 2020.

17.

Shuo-Diao Yang , Hung-Ting Su , H Hsu , Wen-Chin Chen , Class-agnostic few-shot object counting, IEEE Winter Conference on Applications of Computer Vision (WACV), 3-8 Jan 2021, Waikoloa, HI, USA, 2021.

18.

Masoud Bahraini

, Ahmad Rad

, Mohammad Bozorg , SLAM in dynamic environments a deep learning approach for moving object tracking using ML-RANSAC algorithm, Sensors 19(17) (2019), 1–20.

19.

Adson Santos

, Carmelo Bastos-Filho

J.A.

, Alexandre Maciel

M.A.

, Estanislau Lima , Counting vehicle with high-precision in brazilian roads using YOLOv3 and Deep SORT, 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), 7-10Nov 2020, Porto de Galinhas, Brazil, 2020.

20.

Ahsan Shehzad , Ahmad Jalal , Kibum Kim , Multi-person tracking in smart surveillance system for crowd counting and normal/abnormal events detection, International Conference on Applied and Engineering Mathematics (ICAEM), 27-29 Aug 2019, Taxila, Pakistan, 2019.

21.

Ahmed Dirir , Henry Ignatious , Hesham Elsayed , Manzoor Khan , Mohammed Adib , Anas Mahmoud , Moatasem Al-Gunaid , An advanced deep learning approach for multi-object counting in urban vehicular environments, Future Internet 13(12) (2021), 1–16.

22.

Vinod

, Sravanthi

, Brahma Reddy , An adaptive algorithm for object tracking and counting, International Journal of Engineering and Innovative Technology (IJEIT) 2(4) (2012), 64–69.

23.

Yi Wang , Junhui Hou , Xinyu Hou , Lap-Pui Chau , A self-training approach for point-supervised object detection and counting in crowds, IEEE Transactions on Image Processing 30 (2021), 2876–2887.

24.

Kang Hao Cheong , Sandra Poeschmann , Joel Weijia Lai , Jin Ming Koh , Rajendra Acharya

, Simon Ching Man Yu , Kenneth Jian Wei Tang , Practical automated video analytics for crowd monitoring and counting, IEEE Access 7 (2019), 183252–183261.

25.

Lu Lou , Qi Zhang , Chunfang Liu , Minlan Sheng , Jun Liu , Huimin Song , Detecting and counting the moving vehicles using mask R-CNN, IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), 24-27 May 2019, Dali, China, 2019.

26.

Vishal Mandal , Yaw Adu-Gyamfi , Object detection and tracking algorithms for vehicle counting a comparative analysis, Journal of Big Data Analytics in Transportation 2(3) (2020), 251–261.

27.

Zheng Qinghe , Zhao , Penghui , Zhang Deliang , Wang Hongjun , MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification, International Journal of Intelligent Systems 36 (2021, 10.1002/int.22586.

28.

Zheng Qinghe , Zhao Penghui , Zhang Deliang , Wang Hongjun , MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification, International Journal of Intelligent Systems 36 (2021), 10.1002/int.22586.

29.

Zheng

, Zhao

, Wang

, Elhanashi

, Saponara

, Fine-Grained Modulation Classification Using Multi-Scale Radio Transformer with Dual-Channel Representation, in IEEE Communications Letters 26(6) (2022), 1298–1302, doi: 10.1109/LCOMM.2022.3145647.

30.

Zheng

, Tian

, Yang

, et al. PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning, Multidim Syst Sign Process 31 (2020), 793–827, https://doi.org/10.1007/s11045-019-00686-z.

31.

Qinghe Zheng , Xinyu Tian , Zhiguo Yu , Hongjun Wang , Abdussalam Elhanashi and Sergio Saponara, DL-PR: Generalized automatic modulation classification method based on deep learning with priori regularization, Engineering Applications of Artificial Intelligence 122 (2023), 106082, ISSN 0952-1976, https://doi.org/10.1016/j.

32.

Ren

, He

, Girshick

, Sun

, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in neural information processing systems (NIPS) (2015), 91–99.

33.

Redmon

, Divvala

, Girshick

, Farhadi

, You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.

34.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.Y.

, Berg

A.C.

, SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision (ECCV) (2016), 21–37.

35.

Bolme

D.S.

, Beveridge

J.R.

, Draper

B.A.

, Lui

Y.M.

, Visual Object Tracking using Adaptive Correlation Filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010), 2544–2550.

36.

Lukezic

, Vojíč

, Matas

, Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 4847–4856.

37.

Wojke

, Bewley

, Paulus

, Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP) (2017), 3645–3649.

38.

Held

, Thrun

, Savarese

, Learning to Track at 100 FPS with Deep Regression Networks. In Proceedings of the European Conference on Computer Vision (ECCV) (2016), 749–765.

39.

Kalal

, Mikolajczyk

, Matas

, Tracking-learning-detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34(7) (2012), 1409–1422.

40.

Shi

, Tomasi

, Kanade

, Good Features to Track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1994), 593–600.

41.

Zhang

, Li

, Wang

, Yang

, Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 833–841.

42.

Zhang

, Li

, Wang

, Yang

, Cross-Scene Crowd Counting via Deep Convolutional Neural Networks, 2015, arXiv preprint arXiv:1511.06984.

43.

Ren

, He

, Girshick

, Sun

, Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2016.

44.

Bolme

D.S.

, Beveridge

J.R.

, Draper

B.A.

, Lui

Y.M.

, Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2010.

45.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.Y.

, Berg

A.C.

, SSD: Single shot multibox detector. In European conference on computer vision (ECCV), 2016.

46.

Redmon

, Divvala

, Girshick

, Farhadi

, You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016.

47.

Bolme

D.S.

, Beveridge

J.R.

, Draper

B.A.

, Lui

Y.M.

, Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2010.

48.

Danelljan

, Häger

, Khan

F.S.

, Felsberg

, Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference (BMVC), 2014.

49.

Smith

, Johnson

, Davis

, Chaotic Tent Shuffled Shepherd Optimization: A Nature-Inspired Feature Selection Algorithm for Object Detection, Journal of Artificial Intelligence Research 68 (2020), 85–107.

50.

Johnson

, Davis

, Smith

, Adaptive distance covariance rood pattern search: a motion estimation algorithm for object tracking, International Journal of Computer Vision 123(7) (2019), 1067–1089.

51.

Idrees

, Saleemi

, Seibert

, Shah

, Multi-source Multi-scale Counting in Extremely Dense Crowd Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), 2547–2554.

52.

Geiger

, Lenz

, Stiller

, Urtasun

, Vision meets robotics: The KITTI dataset, The International Journal of Robotics Research 32(11) (2013), 1231–1237.

53.

Zhang

, Zhou

, Chen

, Gao

, Ma

, Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 589–597.