Optimization of student behavior recognition based on YOLO and DeepSort in intelligent classroom environment

Abstract

Student behavior recognition is of great significance in the intelligent classroom environment for improving teaching quality, achieving personalized learning, and optimizing classroom management. However, the accuracy and real-time performance of existing technologies in complex scenarios still have limitations. To improve the accuracy and real-time performance of student behavior recognition, the study improved the eighth-generation model of “You Only Look Once” based on the squeezing—incentive attention mechanism, feature pyramid network structure, and anchor box mechanism, significantly enhancing the accuracy and robustness of student behavior recognition. The SE attention mechanism improves the efficiency of feature extraction by enhancing the dependency relationship between feature channels. The FPN structure enhances the multi-scale feature fusion ability by fusing features at different levels. Meanwhile, combined with the DeepSort real-time multi-object tracking algorithm based on deep learning, the problems of identity switching and trajectory loss in object tracking have been effectively solved, continuous tracking of students’ behaviors has been achieved, and the real-time performance has been significantly improved. In the experimental results, the average accuracies of the improved You Only Look Once eighth-generation model on the SCB-dataset3 and CampusGuard datasets reached 82.3% and 83.0%, respectively, which was significantly better than that of the control models. The multi-target tracking accuracy of the DeepSort algorithm is 84.2% and 83.7% respectively, and it also performs well in terms of robustness and real-time performance. The results show that the improved You Only Look Once eighth-generation model and the real-time multi-object tracking algorithm based on deep learning can effectively improve the performance of student behavior recognition and tracking, providing strong technical support for teaching management and personalized learning in the intelligent classroom environment.

Keywords

intelligent classroom YOLOv8 DeepSort behavior recognition behavior tracking

Introduction

With the continuous advancement of educational informatization, intelligent classrooms have gradually become an essential development direction in education. In the Intelligent Classroom Environment (ICE), Student Behavior Recognition (SBR) technology, as one of the key technologies for achieving efficient teaching management and personalized learning, has received widespread attention. By automatically identifying various behaviors of students in the classroom, such as raising hands, reading, leaning against the table, writing, playing with mobile phones, teachers can real-time understand students’ classroom participation and learning status, and adjust teaching strategies promptly to improve teaching quality.^1,2 In addition, SBR technology can also provide data support for personalized learning, helping students better grasp their learning progress and improve learning outcomes.^3,4 For example, by identifying whether students actively participate in classroom interactions (such as raising their hands to speak), teachers can judge the students’ comprehension level and provide targeted guidance. Meanwhile, identifying whether students are distracted (such as playing with mobile phones) can help teachers intervene in time and maintain good classroom order. Currently, with the rapid growth of Deep Learning (DL) technology, significant progress has been made in object detection and tracking algorithms. However, in ICE, there may be differences in the behavior patterns of different students, and the behavior of the same student may also change at different times.^5,6 Secondly, the accuracy and real-time performance of the existing technologies in complex scenarios still have limitations, especially in target tracking, there are problems such as identity switching and trajectory loss. These problems have seriously affected the practical application effect of student behavior recognition technology in intelligent classrooms. Furthermore, the robustness of target tracking is insufficient: Existing technologies are prone to problems such as identity switching and trajectory loss in target tracking, especially in multi-person scenarios, which seriously affects the continuity and accuracy of behavior tracking. The You Only Look Once (YOLO) algorithm is a revolutionary method in the field of object detection and has received extensive attention for its fast and accurate detection ability. The core idea of the YOLO algorithm is to transform the object detection problem into a regression problem, directly predicting bounding boxes and category probabilities in the image, thereby achieving end-to-end object detection. Among them, YOLOv8, as the latest version, has been comprehensively optimized in aspects such as feature extraction, model architecture, and training strategies, achieving higher detection accuracy and faster speed. DeepSort is a real-time multi-target tracking algorithm based on deep learning. By combining Kalman filtering and appearance feature matching, it achieves continuous tracking of targets. The core idea of the DeepSort algorithm is to solve the problems of identity switching and trajectory loss in target tracking through prediction and matching. Real-time DeepSort achieves higher real-time performance by optimizing the algorithm structure and computing process, and is suitable for large-scale target tracking scenarios.

At present, some scholars have also conducted research on student behavior recognition. Jia et al. proposed a DL-based student behavior detection framework to address the difficulty of real-time and accurate detection of student behavior in traditional classroom environments. This framework consisted of the You Only Look Once Version 5-Advanced (YOLOv5-A) network and OpenPose. The model achieved a mean Average Precision (mAP) of 0.821 for multi-class tasks, which was superior to the control model and could accurately detect various behaviors of students in the classroom.⁷ Alruwais proposed a DL-based framework to address the current situation of student recognition and behavior monitoring in online classrooms. This framework was mainly implemented by Convolutional Neural Betwork (CNN) models for student recognition and emotion monitoring, to detect students’ emotional states. This framework could effectively identify students and monitor their behavior, thereby enhancing student engagement and learning outcomes in online learning environments.⁸ Pabba et al. proposed a visual intelligence system based on body posture and facial features to address the current situation of student behavior classification in intelligent classrooms. The system contained OpenPose and CNN, which could recognize multiple behaviors and emotional states, and improve classification accuracy through multi-modal data fusion. The system performed well in student behavior classification tasks, providing real-time feedback to teachers, helping optimize the teaching process, and enhancing student participation and learning outcomes.⁹ Sharma et al. proposed a 3DCNN-based approach to recognize the behavior of students and teachers in classroom environments. This algorithm mainly used a single detection method to identify whether the scene belongs to students or teachers, and then classified the behavior through 3DCNN. The algorithm achieved an average recognition accuracy of 83.5% on the dataset and was able to effectively recognize single person behaviors. However, its accuracy was relatively low in multi-person behavior recognition.¹⁰ Wang et al. proposed a 3DCNN-based SBR algorithm to handle the current situation of SBR in ICE. This algorithm modeled the spatiotemporal features in video data using 3DCNN, which could efficiently and accurately automatically recognize the diverse and subtle behaviors of students in actual learning environments. This algorithm performed well in SBR tasks, providing educators with a tool to gain a deeper understanding of students’ learning habits and behavior patterns. It also provided data support for classroom management, personalized learning path design, and innovative teaching interactions.¹¹ In conclusion, the research fills the gap of existing technologies in student behavior recognition and tracking in intelligent classrooms by optimizing the YOLOv8 model and combining it with the improved DeepSort algorithm. The improved YOLOv8 model significantly enhances the detection accuracy and robustness by introducing the SE attention mechanism and the feature pyramid network structure. The improved DeepSort algorithm effectively solves the key problems in target tracking by optimizing the matching strategy and the appearance feature update mechanism. These innovations provide more powerful technical support for teaching management and personalized learning in the intelligent classroom environment.

Although the above research has achieved good experimental results, there are issues with identity switching and trajectory loss in target tracking. This study proposes an improved YOLOv8 model and DeepSort algorithm based on this to identify and track students’ classroom behavior. The paper tries to improve the precision and Real-Time Performance (RTP) of SBR by optimizing the YOLOv8 model and combining it with the DeepSort algorithm, while achieving continuous tracking of student behavior. The innovation lies in introducing the Squeeze-and-Excitation (SE) attention mechanism and Feature Pyramid Network (FPN) structure, which makes the improved YOLOv8 to capture the student behavior’s features, thereby optimizing the accuracy and robustness of detection.

Methods and materials

Intelligent classroom SBR based on improved YOLOv8

In today’s education field, digital transformation has become a key strategy to improve the education quality. By introducing an artificial classroom intelligent feedback system, real-time collection and analysis of data such as teacher-student voice, behavior trajectory, and interaction frequency can be achieved, greatly enhancing the scientificity and precision of teaching management.¹² Additionally, artificial intelligence technology has also promoted innovation in educational evaluation and teaching research models, providing teachers with more scientific management basis.^13,14

To improve the accuracy of SBR, an intelligent classroom SBR method based on improved YOLOv8 is proposed. The model can more accurately detect and recognize various behaviors of students in the classroom by perfecting the model architecture and training strategy. The core idea of YOLOv8 is to characterize the detection task of related targets as a regression problem. In the context of smart classrooms, students’ behaviors are diverse and complex, such as raising hands, reading, leaning against the table, writing, playing with their phones, etc.^15,16 The relevant diagram is shown in Figure 1.

Figure 1.

Intelligent classroom student behavior diagram. (a) Raise one's hand and (b) Read.

Based on the input image information in Figure 1, the YOLOv8 model can directly predict the bounding box and category probability of the target from the behavioral image. In the YOLOv8 model, it mainly includes a backbone network, a neck network, and a head network. The backbone network is mainly responsible for extracting features of student behavior images, the neck network fuses and enhances the features of the images, and the head network generates the final detection results.^17,18 Firstly, for the backbone network part, this study introduces the SE mechanism module. It learns the dependency relationships between channels, weights feature channels, and suppresses unimportant channels while enhancing the response of important feature channels. In the model, the SE attention mechanism first performs a global average pooling operation on the input feature map, compressing the feature map of each channel into a scalar to obtain a one-dimensional feature vector the same as the number of channels. Secondly, the feature vectors after global average pooling are input into the two fully connected layers. A weight vector is generated through the nonlinear activation function and is used to weight the features of each channel. Finally, the generated weight vector is multiplied channel by channel with the input feature map to obtain the weighted feature map. Through the above steps, the SE module can dynamically adjust the feature weights of each channel, enabling the model to pay more attention to the feature channels related to students’ behaviors, thereby improving the accuracy of detection. The SE module is integrated into the backbone network of YOLOv8. Each residual block is followed by an SE module for enhancing the feature representation. Specifically, this study defines the dimension of the student behavior input Feature Map (Fmap) as $C \times H \times W$ . $C$ denotes the number of channels. $H$ and $W$ are the height and width of the Fmap. Firstly, a global average pooling operation was performed on the Fmap of student behavior to obtain an $C \times 1 \times 1$ feature vector, which is mathematically expressed as formula (1).

z_{i} = \frac{1}{H \times W} \sum_{j = 1}^{H} \sum_{k = 1}^{W} x_{i j k}

(1)

In formula (1), $x_{i j k}$ is the element of the input Fmap in channel $i$ , row $j$ , column $k$ . Subsequently, $z_{i}$ is inputted into the Fully Connected Layer to obtain an intermediate feature vector $u_{i}$ , as shown in formula (2).

u_{i} = σ (W_{2} \cdot δ (W_{1} \cdot z_{i}))

(2)

In formula (2), $W_{1}$ and $W_{2}$ are the weight matrices of the FCL. $δ$ and $σ$ are activation functions, ReLU, and Sigmoid. Finally, the intermediate feature vector is multiplied with the input Fmap channel by channel to gain a weighted Fmap, as shown in formula (3).

{\overset{⌢}{x}}_{i j k} = x_{i j k} \cdot u_{i}

(3)

In formula (3), ${\overset{⌢}{x}}_{i j k}$ is the weighted Fmap. By introducing the SE module into the YOLOv8 backbone network, the model can better focus on feature channels related to student behavior, thereby improving the detection accuracy.^19,20 In the neck network of YOLOv8 model, this study introduces FPN structure to improve it, as shown in Figure 2.

Figure 2.

FPN structure diagram.

In Figure 2, the FPN structure fuses different hierarchical features in the backbone network to generate a feature pyramid with rich semantic and spatial information. FPN is a network structure used for multi-scale feature fusion, aiming to generate a feature pyramid with rich semantic and spatial information by fusing features at different levels. It first extracts the multi-scale feature map of the input image through the backbone network. Secondly, the low-resolution feature map is upsampled to the size of the high-resolution feature map through the upsampling operation, and then added element by element with the high-resolution feature map to achieve feature fusion. After this approach, the FPN structure can fuse features at different levels, enabling the model to better capture the characteristics of students’ behaviors at different scales, thereby improving the accuracy and robustness of detection. The FPN structure is integrated into the neck network of YOLOv8 to enhance the multi-scale feature fusion capability. Specifically, the Fmaps output by the backbone network are defined as $Q_{3}$ , $Q_{4}$ , and $Q_{5}$ , with $Q_{3}$ having the highest resolution and $Q_{5}$ having the lowest resolution. Firstly, $Q_{5}$ is upsampled to have the same resolution as $Q_{4}$ , and then it is added element by element to $Q_{4}$ to obtain the fused Fmap, expressed as formula (4).

Q_{4}^{\sim} = Q_{4} + U_{p} (Q_{5})

(4)

In formula (4), $Q_{4}^{\sim}$ is the fused Fmap, and $U_{p}$ is the upsampling operation. Subsequently, $Q_{4}^{\sim}$ is upsampled to the same resolution as $Q_{3}$ , and then added element by element to $Q_{3}$ to obtain the fused Fmap $Q_{3}^{\sim}$ , as shown in formula (5).

Q_{3}^{\sim} = Q_{3} + U_{p} (Q_{4}^{\sim})

(5)

By introducing FPN structure to optimize the neck network of YOLOv8, it can better capture the features of student behavior at different scales, thereby improving the detection robustness. Finally, in the head network, the paper introduces anchor box mechanism to improve it, as shown in Figure 3.

Figure 3.

Anchor frame mechanism.

In Figure 3, by introducing anchor boxes into the head network, the YOLOv8 model can more accurately predict the location and size of the target’s bounding box.^21,22 The anchor box mechanism enables the model to predict the bounding box position and size of the target more accurately by predefining a set of anchor boxes of different sizes and aspect ratios. The size and aspect ratio of the anchor boxes are set according to the distribution of the targets in the dataset to cover the majority of the targets. Meanwhile, the number and distribution of the anchor boxes are adjusted through experiments to optimize the detection performance. The anchor box mechanism is integrated into the header network of YOLOv8 to predict the category probability and bounding box offset at each position. The calculation process of the head network is as follows: for each input Fmap, each position is defined as having $K$ anchor boxes. For a single anchor box, the head network predicts its category probability and bounding box offset, as expressed in formula (6).

{\begin{cases} p_{i} = s o f t \max (f_{c l s} (x)) \\ t_{x} = \frac{x - a_{x}}{a_{w}} \\ t_{y} = \frac{y - a_{y}}{a_{h}} \\ t_{w} = \log \frac{w}{a_{w}} \\ t_{h} = \log \frac{h}{a_{h}} \end{cases}

(6)

In formula (6), $p_{i}$ is the category probability and $(t_{x}, t_{y}, t_{w}, t_{h})$ is the offset of the bounding box. $f_{c l s} (x)$ is a convolutional layer used in the head network to predict class probabilities. $(x, y, w, h)$ is the predicted bounding box position and size. $(a_{x}, a_{y}, a_{w}, a_{h})$ is the location and size of the corresponding anchor box. Introducing anchor box mechanism into the head network of YOLOv8 model enables the model to more accurately predict the position and size of bounding boxes for student behavior. In addition, to prevent overfitting in the model, this study adopts L2 regularization technique, which adds a penalty term proportional to the sum of squared weights to the loss function to prevent overfitting, as shown in formula (7).

L o s s_{l} = L o s s + λ \sum_{i} ω_{i}^{2}

(7)

In formula (7), $L o s s$ is the original loss function, $λ$ is the regularization coefficient, and $ω_{i}$ is the weight of the model. Finally, the intelligent classroom SBR process built on the improved YOLOv8 model is displayed in Figure 4.

Figure 4.

Flow chart of intelligent classroom SBR.

In Figure 4, the input stage mainly involves inputting student behavior images in an intelligent classroom scenario. The image contains various behaviors such as raising hands, reading, leaning against a table, writing, playing with a phone, etc. Next is the backbone network stage, which includes extracting features from the input image to obtain a Fmap, and using the SE mechanism to gain a weighted Fmap. Next is the neck network stage, which uses FPN structure to fuse features at different scales in the Fmap. In the head network stage, the anchor box mechanism YOLOv8 model can more accurately predict the boundary box position and size of student behavior. Finally, based on the predicted category probability and bounding box offset, the final prediction result is generated, including the category and location information of student behavior.^23,24

Student target behavior tracking based on DeepSort algorithm

After implementing precise detection of student behavior based on improved YOLOv8, to further achieve continuous tracking of student target behavior, this study introduces the DeepSort algorithm. This algorithm is based on appearance features and Kalman Filter (KF), which can effectively solve the problem of continuous tracking of targets. The core of DeepSort lies in combining KF and Hungarian Algorithm (HA) to achieve continuous tracking of student targets through prediction and matching.^25,26 Specifically, the workflow of DeepSort can be divided into four steps: object detection, KF prediction, object matching, and trajectory update and management. Firstly, object detection adopts the improved YOLOv8 model mentioned above to detect student behavior targets in each frame of the image, obtaining the bounding boxes and feature information of the targets. Secondly, KF predicts the position and velocity of each known tracking target in the next frame based on the KF. The state vector of student goals is defined as $X = [x, y, v_{x}, v_{y}]$ , where $x$ and $y$ are the center positions of the goals, and $v_{x}$ and $v_{y}$ are the speeds of the goals. Therefore, the prediction step based on KF is calculated as shown in formula (8).

{\begin{cases} X_{k | k - 1} = F_{X_{k - 1 | k - 1}} \\ P_{k | k - 1} = F P_{k - 1 | k - 1} F^{T} + E \end{cases}

(8)

In formula (8), $X_{k | k - 1}$ is the predicted state of time step $k$ . $F$ is the state transition matrix. $P_{k | k - 1}$ means the covariance matrix of the predicted state. $E$ denotes the covariance matrix of the process noise. In the DeepSort algorithm, the calculation of the state transition matrix is shown in formula (9).

F = [\begin{array}{l} \begin{array}{l} 1 & 0 & Δ t & 0 \end{array} \\ \begin{array}{l} 0 & 1 & 0 & Δ t \end{array} \\ \begin{array}{l} 0 & 0 & 1 & 0 \end{array} \\ \begin{array}{l} 0 & 0 & 0 & 1 \end{array} \end{array}]

(9)

In formula (9), $Δ t$ is the time interval, which usually represents a time interval of 1 frame. Target matching is the core of the DeepSort algorithm, which aims to match the predicted trajectory with the detected target in the current frame. The matching process is divided into cascade matching and IoU matching. Cascade matching is based on the appearance features of the target and is achieved by calculating the cost matrix between the predicted trajectory and the target, as shown in formula (10).

C_{i, j} = d (f_{i}, t_{j})

(10)

In formula (10), $C_{i, j}$ is the cost matrix, $f_{i}$ is the feature information, and $t_{j}$ is the detection target. $d$ is a distance measure between features, usually represented by cosine distance, as shown in formula (11).

d (m, n) = 1 - \frac{m \cdot n}{‖ m ‖ \cdot ‖ n ‖}

(11)

In addition, in cascade matching, besides cosine distance, this study introduces Mahalanobis distance to filter out unreasonable matching terms, as shown in formula (12).

M_{a} (m, n) = \sqrt{{(m - n)}^{T} S^{- 1} (m - n)}

(12)

In formula (12), $S$ is the covariance matrix. Unmatched trajectories in cascade matching are further matched using IoU matching. It is the intersection to union ratio of two bounding boxes, mainly used to characterize the similarity of the bounding boxes, as shown in formula (13).

I o U (B_{1}, B_{2}) = \frac{A (B_{1} \cap B_{2})}{A (B_{1}) + A (B_{2}) - A (B_{1} - B_{2})}

(13)

In formula (13), $B_{1}$ and $B_{2}$ are two bounding boxes, and $A$ is the region information. Finally, based on the matching results, the DeepSort updates the status of each trajectory, with the detection results used to update the successfully matched trajectories. Unmatched trajectories are defined as lost states. If a trajectory is continuously lost for more than a set number of frames, it needs to be deleted. For unmatched detection results, a new trajectory will be initialized.^27,28 Finally, regarding the appearance features of the target, the DeepSort can periodically update the measurement model of the appearance features within a certain time range to adapt to changes in the target’s appearance, as shown in formula (14).

F_{n e w} = F_{o l d} + α (F_{n e w} - F_{o l d})

(14)

In formula (14), $α$ is the learning rate, and $F_{n e w}$ and $F_{o l d}$ are the newly detected features and old features. Finally, the student target behavior tracking process based on the DeepSort is exhibited in Figure 5.

Figure 5.

Flow chart of student goal behavior tracking.

Figure 5 shows the tracking process. Firstly, in terms of object detection, the student behavior targets in each frame of the image are detected based on the improved YOLOv8 model to obtain the bounding boxes and feature information of the targets. Subsequently, KF predicts the state information of the known tracking target in the previous frame as the basis, and outputs the predicted trajectory state and covariance matrix by predicting the position and velocity of the target in the next frame. Subsequently, the trajectories are matched through cascade matching and IoU matching, and successful and unsuccessful matches are outputted. Finally, based on the matching results, after updating and initializing the trajectory, the updated trajectory information is output to complete the DeepSort algorithm’s continuous tracking task of student target behavior.

Results

Improvement of YOLOv8 model performance and analysis of SBR results

This study first validates the performance of the improved YOLOv8 and analyzes its ability to recognize student behavior. Table 1 shows the experimental configuration and algorithm parameter information.

Table 1.

Experimental environment and algorithm parameter information.

Computer configuration	Type configuration	YOLOv8 parameter	Numerical value	DeepSort algorithm parameter	Numerical value
CPU	Intel Core i7-10700K	Initial learning rate	0.001	IoU threshold	0.5
GPU	NVIDIA GeForce RTX 3080	Weight attenuation coefficient	0.01	The trajectory lost frame threshold	70
Internal memory	32 GB	Regularization coefficient	0.0005	Appearance feature update learning rate	0.001
Storage	1 TB	Confidence threshold	0.5	/	/
Python version	Python 3.8	IoU threshold	0.5	/	/
Deep learning framework	PyTorch 1.12.1	Input image size	640 × 640	/	/
CUDA version	CUDA 11.3	/	/	/	/

Based on Table 1, SCB-dataset3 and CampusGuard are selected as the datasets. The SCB-dataset3 dataset is specifically designed for student behavior detection research and contains student behavior annotations in various classroom scenarios. The dataset contains approximately 5000 images, covering seven behavioral categories: raising hands, reading, leaning against the table, writing, playing with mobile phones, lowering the head and lying on the table. The image resolution is 1920 × 1080, and the annotation information includes the bounding box of the target and the category label. The CampusGuard dataset focuses on student behaviors in campus environments and contains 3000 images labeled with six behavioral categories: walking, standing, talking, using mobile phones, reading, and writing. The image resolution is 1280 × 720, and the annotation information also includes the bounding box and category label of the target. Furthermore, the annotation information of the SCB-dataset3 dataset is accurate and reliable, including the bounding boxes and category labels of the targets, providing a high-quality data basis for model training and evaluation. The CampusGuard dataset is publicly available, facilitating the reproduction and comparison of experimental results by other researchers and enhancing the verifiability and transparency of the research. In order to improve the training effect and detection performance of the model, the following preprocessing steps were carried out on the two datasets. Adjust all images to a uniform input size of 640 × 640. (2) Adopt data enhancement methods such as random cropping, color jitter and horizontal enhancement. (3) Convert the original labeled bounding box coordinates to the normalized format required by the YOLOv8 model. The actions such as Hand Raising, Reading, Leaning on Desk, and Writing were tracked, respectively, in the two datasets. The training process is based on the parameter Settings and strategies shown in Table 1 for training. 16 images are trained in each batch, and the total number of training rounds is 300. Meanwhile, the evaluation indicators selected include mean Average Precision (mAP), Accuracy, Recall, Frames Per Second (FPS), and robustness score. mAP is a commonly used performance metric in object detection and tracking tasks, which is used to measure the comprehensive detection accuracy of the model under different confidence thresholds. In real-time behavior tracking, mAP can help evaluate whether the model can strike a balance between high precision and high efficiency in real-time scenarios. Accuracy is an intuitive indicator that can directly reflect the performance of the model in practical applications. In real-time behavior tracking, accuracy directly affects teachers’ judgment of the classroom situation. Recall can evaluate whether the model can detect as many behaviors as possible and avoid missed detectives. In real-time behavior tracking, the model needs to detect as many behaviors as possible within a short period of time. The recall rate can help evaluate whether the model can meet this requirement in real-time scenarios. FPS is an important indicator for measuring the real-time performance of a model. It calculates the number of image frames that the model can process per second. In an intelligent classroom environment, FPS can reflect the utilization efficiency of computing resources by the model. Finally, the robustness score is a comprehensive indicator used to evaluate the stability and reliability of the model in complex scenarios. In the intelligent classroom environment, it can help assess the model’s performance in these complex scenarios and ensure the stable operation of the model in practical applications. This study simultaneously selects the traditional YOLOv8 model, YOLOv5, and YOLOX model for comparison. The accuracy and recall curves based on four network models are shown in Figure 6.

Figure 6.

Accuracy curve and recall curve comparison. (a) Accuracy curve comparison and (b) Recall curve comparison.

In the comparison of accuracy curves in Figure 6(a), the accuracy rate of the improved YOLOv8 (the solid red line) rose rapidly in the early stage of training and tended to stabilize after approximately 100 iterations, eventually reaching a high accuracy rate close to 0.931. This indicates that the improved YOLOv8 model has a very high classification accuracy in the task of student behavior recognition. YOLOv8 (blue solid line) The accuracy rate of the YOLOv8 model also rose rapidly in the early stage of training, but eventually stabilized at a level slightly lower than that of the improved YOLOv8, approximately 0.823. The accuracy of YOLOv5 (the green dotted line) gradually increased during the training process, but eventually stabilized at approximately 0.722, which was lower than that of YOLOv8 and the improved YOLOv8. The accuracy of YOLOX (purple dot line) rose slowly during the training process and eventually stabilized at approximately 0.652, which was the lowest among the four models. This indicates that the YOLOX model performs relatively poorly in the task of student behavior recognition. In the recall curve of Figure 6(b), the improved YOLOv8 model is higher than the comparison model, and its recall tends to 0.952 after convergence. It indicates that the improved YOLOv8 model can detect most student behaviors with a low rate of missed detection. The difference in recall convergence between the traditional YOLOv8 model and YOLOv5 is relatively small, with values of 0.784 and 0.772. The recall rate of YOLOX is also minimized after convergence, and it converges at 20 iterations, with a value approaching 0.692. It indicates that its performance in terms of recall rate is relatively poor. Further comparison of mAP among the three models is shown in Figure 7.

Figure 7.

Comparison of mAP values of network models.

Figure 7 shows the comparison of four models for mAP@50 and mAP@50:95. The two color bands for each model represent the SCB-dataset3 and CampusGuard datasets. The results show that YOLOX (red) performs relatively poorly on the two datasets with values of mAP@50 and mAP@50:95 of 75.8%/73.5% and 69.3%/59.0%, respectively; YOLOv5 (blue) performed well on the two datasets mAP@50 and mAP@50: The 95 values were 80.3%/79.0% and 75.9%/62.0%, respectively; mAP@50 and mAP@50 on the two datasets of YOLOv8 (yellow): The 95 values were 76.4%/77.0% and 71.2%/60.5%, respectively, performing poorly. The improved YOLOv8 (green) performed best on two datasets: mAP@50 and mAP@50:95 values of 82.3%/83.0% and 75.0%/66.0%, respectively. To sum up, the improved YOLOv8 model achieved the highest mAP values on both datasets, indicating that it has higher detection accuracy and robustness in the task of student behavior recognition. The improved YOLOv8 model significantly enhances the recognition accuracy and robustness of the model for student behaviors in complex scenarios by introducing the SE attention mechanism and the feature pyramid network structure. This study further compares the processing speed of four models to evaluate their RTP under ICE, as shown in Figure 8.

Figure 8.

The processing speed of four models. (a) Real-time performance comparison of the SCB-dataset3 dataset and (b) Real-time performance comparison of the CampusGuard dataset.

Figure 8(a) and (b) show the RTP comparison of four network models from parallel experiments 1 and 2 on SCB-dataset3 and CampusGuard datasets. The improved YOLOv8 model has CPU processing speeds of 248FPS and 250FPS, and GPU processing speeds of 347FPS and 350FPS, which are significantly higher than other models. In addition, the YOLOv5 model has the slowest CPU and GPU processing speed and the weakest RTP. This indicates that the research model performs well in RTP on both datasets, significantly outperforming YOLOX, YOLOv5, and traditional YOLOv8. The improved YOLOv8 not only performs excellently in detection accuracy but also has significant advantages in RTP. Further comparison of the model complexity and robustness scores of the four models is listed in Table 2.

Table 2.

Comparison of model complexity and robustness scores.

Number of experiments	Model name	Params/million	GFLOPs	Robustness score
Experiment 1	YOLOX	10.8	12.5	7
Experiment 1	YOLOv5	7.3	15.8	7.5
Experiment 1	YOLOv8	31.6	8.9	8.4
Experiment 1	Improved YOLOv8	35	10.5	9.3
Experiment 2	YOLOX	10.1	12.1	7.2
Experiment 2	YOLOv5	7.2	15.1	7.6
Experiment 2	YOLOv8	31.9	8.7	8.6
Experiment 2	Improved YOLOv8	35.6	10.2	9.5

In terms of model complexity in Table 2, the improved YOLOv8 has an average parameter count of 35.3 million and an average computational complexity of 10.4 GFLOPs. This demonstrates that the improved YOLOv8 has slightly increased in terms of parameter and computational complexity, but the changes are not significant compared to the YOLOv8, indicating that its model structure is also relatively stable. In terms of robustness score, the improved YOLOv8 model has the highest score, with values of 9.3 and 9.5, indicating a slight improvement in its performance in complex scenarios compared to YOLOv8. Overall, the proposed model performs well in both model complexity and robustness, meeting the high requirements for model accuracy and stability under ICE. Finally, this study compares the recognition effects of four models on student behaviors such as raising hands and reading, as shown in Figure 9.

Figure 9.

Student typical behavior identification comparison.

Figure 9 shows the comparison of recognition accuracy among four models. YOLOX performs well in recognizing the behavior of “playing with mobile phones” with a recognition accuracy of 95.8%, which may be due to the visual uniqueness of this behavior and its ease of differentiation by the model. In the recognition of other behaviors, the average recognition accuracy is around 80%. YOLOv5 has a slight advantage in recognizing “raising hands” and “leaning against the table” behaviors, with recognition accuracies of 86.3% and 84.1%, which may be related to the sensitivity of the model to target shape and position. The recognition accuracy of YOLOv8 is generally above 85%, demonstrating higher detection accuracy and robustness. It performs particularly well in recognizing “raising hands” and “leaning against a table” behaviors, with recognition accuracies of 88.9% and 87.3%. The improved YOLOv8 performs well in recognizing all behaviors, with recognition accuracy generally above 90%, significantly higher than the other three models. It performs the best in recognizing “raising hands” and “reading” behaviors, with recognition accuracies of 91.4% and 88.7%. This means that the improved model has stronger discriminative ability when dealing with complex scenes and similar behaviors. In summary, the improved YOLOv8 is excellent in identifying the five behaviors of “raising hands, reading, leaning against the table, writing, and playing with the phone.” High recognition accuracy and robustness make it perform excellently in SBR tasks under ICE.

Performance evaluation and behavior tracking analysis of DeepSort algorithm

After analyzing the performance of the improved YOLOv8 and the SBR effect, this study further validates the performance of the DeepSort algorithm and its effectiveness in tracking student behavior. Meanwhile, Sort, StrongSort, and BotSort algorithms are introduced for comparison. Table 3 compares the tracking accuracy based on four algorithms.

Table 3.

Comparison of tracking accuracy of four algorithms.

Data set	Algorithm	MOTA (%)	MOTP (%)	ID switches	Trajectory loss rate (%)
SCB-dataset3	SORT	68.7	72.4	120	10.2
	BOTSORT	79.5	81.3	78	6.8
	DeepSort	84.2	85.6	55	5.1
	StrongSORT	82.9	84.7	60	5.5
CampusGuard	SORT	67.9	71.8	125	11
	BOTSORT	78.9	80.7	82	7.3
	DeepSort	83.7	85.1	58	5.8
	StrongSORT	81.8	83.9	63	6.2

In Table 3, MOTA mainly measures the overall accuracy of the tracking algorithm. MOTP measures the mean distance between the tracked trajectory and the real trajectory. The number of times the target’s identity is mistakenly switched during the tracking process. Trajectory loss rate refers to the proportion of targets that are not successfully tracked in consecutive frames. DeepSort performs the best in tracking accuracy and robustness, significantly outperforming SORT and BOTSORT algorithms. By introducing appearance features and KF, the number of identity switches and trajectory loss rate have been reduced, improving the accuracy and robustness of tracking. StrongSORT performs well in tracking accuracy and robustness, but slightly lower than DeepSort. The SORT algorithm performs poorly in tracking accuracy, especially in complex scenarios where there are a large number of identity switches and a high trajectory loss rate. This study further compares the trajectory integrity of four algorithms, as shown in Figure 10.

Figure 10.

Trajectory integrity of four algorithms. (a) SCB-dataset3 dataset comparison and (b) CampusGuard dataset comparison.

Figure 10(a) shows the comparison of average trajectory length and trajectory coverage under the SCB-dataset3 dataset. DeepSort performs the best in trajectory integrity, with an average trajectory length and coverage of 120 frames and 92.5%. The algorithm has the longest average trajectory length and the highest trajectory coverage, indicating its ability to generate continuous and complete trajectories, effectively reducing trajectory loss and breakage. SORT shows weak performance in trajectory integrity, with a short average trajectory length and low trajectory coverage, indicating that it is prone to losing targets in complex scenes. Figure 10(b) shows the comparison under the CampusGuard dataset. DeepSort still performs the best, with an average trajectory length and trajectory coverage of 118 frames and 91.5%. Figure 11 compares the RTP and appearance feature updating ability of four algorithms.

Figure 11.

Comparison of updating ability of real-time algorithm and appearance feature. (a) SCB-dataset3 dataset comparison and (b) CampusGuard dataset comparison.

Figure 11(a) compares the real-time and appearance feature update capabilities of the SCB-dataset3 dataset, while Figure 11(b) compares the average trajectory length and trajectory coverage under the CampusGuard dataset. In terms of RTP, DeepSort has an average CPU and GPU processing speed of 235FPS and 315FPS, respectively. It performs well in real-time and has significantly higher processing speed than other algorithms. In terms of appearance feature update capability, DeepSort’s feature update frequency is 10.2 times/second and 9.7 times/second, with feature update effects of 8.7% and 8.3%, respectively. This indicates that DeepSort performs well in updating appearance features, with significantly higher update frequency and effectiveness than other algorithms, and can adapt to changes in target appearance in a timely manner, improving tracking accuracy and robustness. Figure 12 compares the trajectory management capabilities of four algorithms.

Figure 12.

Comparison of trajectory management capabilities. (a) SCB-dataset3 dataset comparison and (b) CampusGuard dataset comparison.

Figure 12(a) and (b) show a comparison of trajectory management capabilities between SCB-dataset3 and CampusGuard datasets. In terms of trajectory initialization time, DeepSort has the shortest time, at 0.020 s and 0.022 s, indicating its highest efficiency in initializing new trajectories. The initialization time of Sort is the longest, at 0.035 s and 0.038 s, indicating a lower efficiency in initializing new trajectories. In terms of trajectory deletion time, DeepSort has the shortest deletion times of 0.010 s and 0.011 s, indicating its highest efficiency in deleting lost trajectories. The study further conducted ablation experiments on each component, and the results are shown in Table 4.

Table 4.

Ablation experiment results.

Model variant	mAP@50 (%)	mAP@50:95 (%)	FPS	Model size (MB)
Basic YOLOv8	73.5	59	45	55
Basic YOLOv8 + SE	77	62	44	56
Basic YOLOv8 + SE + FPN	79.5	64.5	43	57
Basic YOLOv8 + SE + FPN + Anchor box mechanism	80.5	65.5	42	58
A fully improved YOLOv8	83	66.0	50	58

In Table 4, the base YOLOv8 performs 73.5% and 59.0% on mAP@50 and mAP@50:95, with an FPS of 45 and a model size of 55 MB. After adding the SE attention mechanism to the basic YOLOv8 model, mAP@50 and mAP@50:95 increased to 77.0% and 62.0% respectively, and the FPS slightly decreased to 44, indicating that the SE attention mechanism significantly improved the model’s ability to focus on key features and thereby enhanced the detection accuracy. By adding the SE attention mechanism and further introducing the FPN structure, mAP@50 and mAP@50:95 increased to 79.5% and 64.5%, respectively, and the FPS dropped to 43. The FPN structure enhances the model’s detection ability for targets of different sizes through multi-scale feature fusion. Adding the anchor frame mechanism to the FPN structure, mAP@50 and mAP@50:95 further increased to 80.5% and 65.5%, respectively, and the FPS dropped to 42. The anchor box mechanism further improves the detection accuracy of the model through more accurate bounding box prediction. Finally, the complete model integrating all the improvements (SE attention mechanism, FPN structure, and anchor frame mechanism) achieved 83.0% and 66.0%, respectively, at mAP@50 and mAP@50:95, with an FPS of 50 and a model size of 58 MB, indicating that the combination of these improvements significantly enhances the overall performance of the model. The study further introduces the YOLOv10-X, YOLOv9, and PP-YOLOE models, which represent the latest progress in the field of object detection. The comparison of performance indicators of the four models is shown in Table 5.

Table 5.

Comparison of performance indicators of four models.

Model	mAP@50 (%)	mAP@50:95 (%)	FPS	Model size
Improved YOLOv8	83	66	50	58
YOLOv9	81.5	64.5	48	57
YOLOv10-X	84.7	68.2	52	62
PP-YOLOE	82.1	65.7	47	59

In Table 5, the improved YOLOv8 performs exceptionally well at mAP@50 and mAP@50:95, achieving 83.0% and 66.0% respectively, while maintaining a high FPS (50) and a model size of 58 MB, indicating a good balance between accuracy and efficiency. YOLOv10-X performed best at mAP@50 at 84.7%, but slightly underperformed the improved YOLOv8 at mAP@50:95, showing slightly weaker robustness under extreme conditions. The FPS is 52 and the model size is 62 MB, making it suitable for scenarios with extremely high precision requirements. YOLOv9 achieved 81.5% and 64.5%, respectively, at mAP@50 and mAP@50:95, with an FPS of 48 and a model size of 57 MB, demonstrating good overall performance. PP-YOLOE achieved 82.1% on mAP@50, 65.7% on mAP@50:95, with 47 FPS and 59 MB model size, showing a good balance between processing speed and accuracy. In summary, the improved YOLOv8 has strong performance in the intelligent classroom environment, especially when dealing with dynamic and complex scenarios. By comparing with other advanced methods, the comprehensive advantages of the improved YOLOv8 in terms of accuracy, speed and model size can be seen more clearly. In order to comprehensively evaluate the generalization ability of the algorithm, experiments were finally designed in the study to test the student behavior recognition effect in different scenarios (such as daytime, nighttime, strong light, and weak light). Meanwhile, in the parameter Settings of the comparison algorithm, the learning rate was 0.001 for all, the batch size was 16 for all, the training rounds were 300, and the regularization coefficient was 0.0005. At the same time, all experiments were conducted in the experimental environment described in the above study. Therefore, the comparison of behavior recognition effects in different scenarios is shown in Table 6.

Table 6.

Comparison of behavior recognition effects in different scenarios.

Scene type	YOLOv5 (mAP@50)	YOLOX (mAP@50)	Improved YOLOv8 (mAP@50)
Day	78.5	80.2	83
Night	77	78.8	81.5
Strong light	77.8	79.5	82
Weak light	75.5	77	80

In Table 6, the overall improved YOLOv8 model achieved the highest average accuracy in all test scenarios, demonstrating its strong generalization ability and robustness under different lighting conditions. YOLOX performs slightly better than YOLOv5 in all scenarios, but still lags behind the improved YOLOv8. In well-lit daytime scenarios, all models had relatively high mAP@50 values, with the improved YOLOv8 model reaching 83.0% for the best performance. In the night scene, the average accuracy of all models declined, but the improved YOLOv8 model still maintained a high accuracy of 81.5%, demonstrating its superior performance in low-light conditions. Under strong light conditions, the improved YOLOv8 model also performed best, achieving an 82.0% mAP@50 value, which indicates good robustness in dealing with overexposure caused by strong light. In low-light conditions, the improved YOLOv8 model had a mAP@50 value of 80.0%, which was lower but still higher than that of YOLOv5 and YOLOX, indicating high recognition accuracy even in low-light conditions.

Discussion

To improve the accuracy and RTP of SBR under ICE, this study improved the YOLOv8 model by introducing SE mechanism, FPN structure, and anchor box mechanism, and used DeepSort algorithm to achieve continuous tracking of student behavior. In the numerical results, the improved YOLOv8 model achieved mAP of 82.3% and 83.0% on SCB-dataset3 and CampusGuard datasets, significantly better than YOLOX, YOLOv5, and YOLOv8 models. This result was consistent with the study by Mu et al. proposed an improved lightweight YOLOv8 model with an average accuracy of 90.2%, reduced parameter count to 1.4 × 10⁶, and reduced floating-point computational complexity to 4.8 G, significantly better than 12 existing object detection models.²⁹ However, Mu et al.’s research mainly focused on the lightweighting and computational efficiency of the model, while this study focused more on improving the recognition accuracy and robustness of the model for student behavior in complex scenarios by introducing SE and FPN structures. The introduced SE mechanism could enable the model to better focus on feature channels related to student behavior, thereby improving detection accuracy in complex scenarios. The study by Mu et al. did not involve similar optimization of attention mechanisms. The DeepSort algorithm performed well in tracking accuracy, with MOTA values of 84.2% and 83.7%, and also performed well in robustness and RTP, which was consistent with M Shili’s conclusion. Shili et al. proposed a customer behavior tracking and heatmap analysis model based on YOLOv5 and DeepSort, which accurately tracked customer behavior through DeepSort. In contrast, although Shili et al. also used the DeepSort algorithm, they did not conduct detailed experimental analysis on DeepSort’s capabilities in trajectory integrity, appearance feature updating, and trajectory management. Moreover, this study also demonstrated excellent RTP, with DeepSort’s average CPU and GPU processing speeds of 235FPS and 315FPS. It performed well in real-time and had significantly higher processing speed than other algorithms, but the study by Shili et al. did not mention the specific performance of DeepSort in real-time.³⁰

In summary, the improved YOLOv8 model and DeepSort algorithm perform well in intelligent classroom SBR and tracking tasks. The improved YOLOv8 significantly improves detection accuracy and robustness by optimizing model architecture and training strategies, and can better adapt to complex scenarios. DeepSort effectively solves the problems of identity switching and trajectory loss in target tracking by combining Kalman filtering and appearance feature matching, improving the accuracy and stability of tracking. The application of these technologies gives strong technical support for teaching management and personalized learning under ICE. However, in an intelligent classroom environment, multiple students may exhibit similar behaviors simultaneously (for example, multiple students raise their hands at the same time). This situation poses a challenge to the model’s discrimination ability because similar behaviors may make it difficult for the model to accurately identify the specific behaviors of each student. In order to improve the discrimination ability of the model when dealing with similar behaviors, visual information and data of other modalities (such as audio or sensor data) are combined to provide richer context information and help the model better distinguish similar behaviors. Meanwhile, by using the time series information of students’ behaviors, the dynamic changes of the behaviors are analyzed, thereby improving the model’s ability to recognize similar behaviors.

Furthermore, in different classroom environments, the number of students and the size of the classroom may vary greatly, which may pose challenges to the model’s detection and tracking capabilities. The model can adapt to classrooms of different scales by dynamically adjusting detection parameters (such as the size and quantity of anchor boxes). For example, in large classrooms, the number of anchor boxes can be increased and their sizes adjusted to better detect and track more students. Meanwhile, in large classrooms, multiple cameras can be deployed to monitor students’ behaviors from different angles. Through the collaborative work of multiple cameras, students’ behaviors can be captured more comprehensively, reducing blind spots and missed detections. In dealing with the diversity of students’ behaviors, students’ behaviors may show diversity due to individual differences, differences in classroom activities and teaching contents. The model needs to be able to adapt to and identify these different behaviors. By combining visual information with data from other modalities (such as audio or sensor data), richer contextual information can be provided to help the model better understand and identify different behaviors.

Conclusion

This study used improved YOLOV8 and DeepSort to identify and track students’ behavior. The proposed algorithm was found to have excellent performance in SBR and tracking scenarios. In different classroom Settings, the number of students can vary greatly, ranging from small seminars to large lectures. More students mean more goals that need to be identified and tracked. The improved YOLOv8 model effectively handles the problem of multi-scale object detection through the Feature Pyramid Network (FPN) structure. FPN can extract information from feature maps at different levels, thereby better identifying and tracking targets of different sizes and distances. Meanwhile, the model can dynamically adjust its detection and tracking strategies. When more targets are detected, the model can automatically increase the detection frequency or adjust the size of the anchor box to ensure that the behaviors of all students can be accurately captured. The DeepSort algorithm combines Kalman filtering and appearance feature matching, effectively solving the problems of identity switching and trajectory loss in target tracking, and improving the accuracy and stability of tracking.

Although significant achievements have been made in SBR and tracking in this study, there are still some limitations. Firstly, the robustness of the improved YOLOv8 in dealing with extreme lighting changes and background interference still needs improvement. Secondly, the RTP of the DeepSort algorithm in handling large-scale targets still needs to be optimized. The future research direction will focus on further improving the model’s robustness and RTP, exploring more efficient object detection and tracking algorithms, and applying these technologies to a wider range of educational scenarios to achieve more intelligent teaching management and personalized learning. Furthermore, in order to improve the robustness of the model when dealing with more complex or noisy data, data augmentation can be achieved by introducing more noisy data (such as blurring, illumination changes, occlusion, etc.), enabling the model to adapt to various noisy environments during the training process. Meanwhile, robustness regularization terms, such as Adversarial Training or noise robustness loss functions, are introduced during the model training process to enhance the model’s resistance to noise. In addition, explore more advanced deep learning techniques, such as self-supervised learning and graph neural networks, to further enhance the generalization ability and recognition accuracy of the model. After the model optimization, the research results will be applied to more diverse educational scenarios, such as online education, distance teaching and personalized tutoring, in order to achieve a wider application of educational technology.

Footnotes

ORCID iD

Suying Li

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Zhang

Yang

Liu

, et al. A time-series neural network for pig feeding behavior recognition and dangerous detection from videos. Comput Electron Agric 2024; 218(4): 108710–108731.

Bhosle

Musande

. Evaluation of deep learning CNN model for recognition of Devanagari digit. Artif Intell Appl 2023; 1(2): 98–102.

Karwan

MHR

Mohammed

Ayub

, et al. A comparative study using 2D CNN and transfer learning to detect and classify Arabic-Script-Based sign language. Acta Informatica Malaysia 2023; 7(1): 08–14.

Varughese John

. An empirical investigation into the extent university students utilise mobile educational applications for learning. Acta inform Malays 2022; 6(1): 25–33.

Luo

Chen

Guo

, et al. Teaching reform of machine learning course based on the integration mode of PBL and LBL. Adv Ind Eng Manag 2023; 12(2): 74–78.

Shao

Zhang

. The application of computer simulation technology in college physical education teaching. Adv Ind Eng Manag 2024; 13(3): 275–277.

Zhang

. On the reform of public elective course based on online coursein the era of artificial intelligence education. Adv Ind Eng Manag 2023; 12(1): 34–36.

Alruwais

Zakariah

. Student recognition and activity monitoring in E-classes using deep learning in higher education. IEEE Access 2024; 12(4): 66110–66128.

Pabba

Bhardwaj

Kumar

. A visual intelligent system for students’ behavior classification using body pose and facial features in a smart classroom. Multimed Tool Appl 2024; 83(12): 36975–37005.

10.

Sharma

Gupta

Kumar

, et al. Star-3d: a holistic approach for human activity recognition in the classroom environment. Information 2024; 15(4): 179–199.

11.

Wang

. Research on student movement behavior recognition based on 3D-CNN algorithm. Proc 2024 IEEE 2nd Int Conf Image Process Comput Appl (ICIPCA) 2024; 4(1): 1791–1795.

12.

Yang

Yeter

, et al. Artificial intelligence education for young children: a case study of technology-enhanced embodied learning. J Comput Assist Learn 2024; 40(2): 465–477.

13.

Zhan

Tong

Lan

, et al. A systematic literature review of game-based learning in Artificial Intelligence education. Interact Learn Environ 2024; 32(3): 1137–1158.

14.

Lim

. Metaphor analysis on pre-service early childhood teachers’ conception of AI (Artificial Intelligence) education for young children. Think Skills Creativ 2024; 51(3): 101455–101478.

15.

Zhao

Zhang

, et al. A quality grade classification method for fresh tea leaves based on an improved YOLOv8x-SPPCSPC-CBAM model. Sci Rep 2024; 14(1): 4166–4188.

16.

Xiao

Zhao

. Lightweight sea cucumber recognition network using improved YOLOv5. IEEE Access 2023; 11(4): 44787–44797.

17.

Liu

Hong

, et al. Identification of varieties of wheat seeds based on multispectral imaging combined with improved YOLOv5. Food Phys 2025; 2(3): 100042–100066.

18.

Pandey

Bhandari

. Morphological transfer learning based brain tumor detection using YOLOv5. Multimed Tool Appl 2024; 83(16): 49343–49366.

19.

Liang

Ren

Song

. Wheat FHB resistance assessment using hyperspectral feature band image fusion and deep learning. Int J Agric Biol Eng 2024; 17(2): 240–249.

20.

Ali

Dhanaraj

Kadry

. AI-enabled IoT-based pest prevention and controlling system using sound analytics in large agricultural field. Comput Electron Agric 2024; 220(2): 108844–108869.

21.

Shin

. Development of YOLOv8 and segment anything model algorithm-based Hanok object detection model for sustainable maintenance of Hanok architecture. Sustainability 2024; 16(9): 3775–3794.

22.

Mishra

Yadav

. Vehicle detection in high density traffic surveillance data using YOLOv5. Recent Adv Electr Electron Eng 2024; 17(2): 216–227.

23.

Host

Pobar

Ivasic-Kos

. Analysis of movement and activities of handball players using deep neural networks. J Imaging 2023; 9(4): 80–99.

24.

Saudagar

AKJ

, et al. Student behavior recognition for interaction detection in the classroom environment. Image Vis Comput 2023; 136(4): 104726–104746.

25.

Huang

Xiao

Liu

, et al. An improved pig counting algorithm based on YOLOv5 and DeepSORT model. Sensors 2023; 23(14): 6309–6332.

26.

Liu

Jiang

, et al. Dynamic identification and automatic counting of the number of passing fish species based on the improved DeepSORT algorithm. Front Environ Sci 2023; 11(4): 1059217–1059236.

27.

Yang

Miao

Liu

, et al. Improved foreign object tracking algorithm in coal for belt conveyor gangue selection robot with YOLOv7 and DeepSORT. Measurement 2024; 228(4): 114180–114193.

28.

Ngeni

Mwakalonge

Siuhi

. Solving traffic data occlusion problems in computer vision algorithms using DeepSORT and quantum computing. J Traffic Transport Eng 2024; 11(1): 1–15.

29.

Wang

, et al. Research on the behavior recognition of beef cattle based on the improved lightweight CBR-YOLO model based on YOLOv8 in multi-scene weather. Animals 2024; 14(19): 2800–2823.

30.

Shili

Jayasingh

Hammedi

. Advanced customer behavior tracking and heatmap analysis with YOLOv5 and DeepSORT in retail environment. Electronics 2024; 13(23): 4730–4753.