Abstract
To improve the application of behavior detection technology in college education, the study proposes a new model built on deep CNN, which is used for student behavior detection and analysis in college labor education courses. The study first analyzed the target detection algorithm, and optimized the selected You Only Look Once version 5 (YOLOv5) algorithm and its network structure with a series of improvements, and based on this, embedded the attention module into the algorithm structure to finally obtain a new model, namely YOLOv5-O. After a series of experiments, YOLOv5-O reached an average accuracy of 90.1% on the test set, while the application test in the actual teaching environment showed that its average accuracy was 86.7%. This result is obviously superior to the existing technology, which proves the validity of the study and provides strong data support for the automatic detection of student behavior. In addition, in the teaching experiment, YOLOv5-O assisted teaching achieved the most significant teaching effect, and students’ achievement improved the most. The feasibility of this method is verified.
Keywords
Introduction
As one of the current hot spots of research in computer vision, target detection is utilized in diverse zones extensively due to its benefits [1]. In the field of education, the application of intelligent technology has become a key force to promote the innovation of modern teaching methods. Through intelligent monitoring and behavior analysis technologies, educators can more effectively evaluate and improve instructional strategies to provide students with a more personalized learning experience. Goal detection technology has shown great potential in monitoring student engagement, behavior analysis and classroom management. With the rapid development of technology, especially the breakthroughs in the fields of artificial intelligence and computer vision, new possibilities have been brought to the development of labor education courses. In order to improve students’ learning quality, some schools have gradually adopted student behavior detection technology to supervise students’ learning. Among the existing student behavior detection technologies, the main problem is that in a complex educational environment, these technologies often fail to achieve ideal detection accuracy due to the complexity of environmental factors and the difficulty of student behavior data collection [2, 3]. This problem limits the application of technology in the actual education scene, and affects the improvement of education quality and the effective adjustment of teaching strategies. In view of this, a model based on YOLOv5 network structure optimization is proposed to improve the accuracy of student behavior detection. In this study, the network structure of YOLOv5 was innovatively improved, including the addition of new feature fusion layers and the feature fusion strategy optimized by bidirectional cross-scale principle. These improvements aim to take full advantage of the powerful capabilities of deep learning in complex environments to more accurately identify and analyze student behavior in college labor education courses. This research has developed a behavior detection model with strong adaptability and high accuracy, which is of great significance for promoting the modernization of teaching activities and the scientific and technological management of college curriculum. In addition, the successful application of YOLOv5-O model also provides a feasible reference for the future study of student behavior analysis in similar environments.
Related works
Due to the raised deep learning theory and the optimization of information technology computing devices, CNNs develops extremely fast and are widely used in computers, language processing, medicine and other fields, and many experts have carried out research for their characteristics [4]. Wan et al. In order to save parameters, a convolutional neural network constructed a model with more efficient sparse kernel in order to save parameters. The study proposes a scheme for space reduction in terms of components, performance parameters, and efficiency, and a new model transformation is proposed based on this scheme. The research also concluded that the accuracy and efficiency of the model already exceeds the level of state-of-the-art networks and can be used to improve the accuracy and efficiency of existing state-of-the-art models [5]. Wang et al. proposes a CNN-based and supported model for target detection for skill transformation in manufacturing scenarios. The study evaluates the model performance by analyzing the tool assembly using two cameras. The model is 94% accurate for target objects, and the model combines the above two learning modes to achieve simultaneous detection of targets as well as recognition of actions, which to some extent speeds up skill transformation in manufacturing systems [6]. Irmak designed a novel automatic CNN model for disease severity classification. It uses lung pictures as input and classifies the disease degree into four levels and can reach 95% accuracy. Experimental results confirmed the practical feasibility of CNN, combining it with a sufficient amount of image data for accurate disease degree assessment at different stages [7]. Shen et al. proposed a method for extracting building roofs from echo images for improving the speed of geographic information extraction as well as data acquisition. The study used and aggregation module to improve the YOLOv5 algorithm. It is concluded that the improved YOLOv5s can reach an average accuracy value of 95.3% and use a CPU of 11.75 frames per second to acquire building roofs from images more efficiently [8]. Xu et al. designed a new dedicated deep learning network for achieving efficient push-scan imaging. The study integrates bilateral filters into the convolutional neural network for generating learning architectures and secondly evolves YOLOv5 into YOLOvf51 by refining the granularity. The data found that the proposed algorithm is supposed to detect accuracy in real time and can reach 93.47% detection accuracy [9].
To address the problems of current monitoring mechanisms that distinguish live iris from attack patterns, Choudhary et al. introduced a new framework model that uses the YOLO algorithm to locate regions and use this for image selection and core texture detail processing. The YOLO method can accurately restrict the iris region without loss and can effectively reduce the error of existing techniques, somewhat removing the threat of forged iris confusion recognition systems [10]. Zheng et al. proposed a model that combines CNN with a scale transfer module in order to improve the diagnosis of small cell lung adenocarcinoma. The study extracted data from a computed tomography database and used the proposed model for training tests. This model can increase the lung nodule classification accuracy to some extent compared with current methods, providing an effective means to predict risk for early diagnosis [11]. Tang et al. proposed a novel pose detection framework for current computer vision’s inability to detect targets under occlusion conditions. The study combines the region of interest with the characteristics of the convolutional feature map, which makes the combined features more performative. This proposed detection framework can make a class of features closer in space, exhibit stronger classification ability, and is more feasible than existing detection methods [12].
As an advanced target detection network, YOLO has proven its excellent performance in a variety of scenarios, especially in dealing with real-time video streams and target detection in highly dynamic environments [13]. Its fast and accurate detection capabilities make it ideal for behavioral analysis in surveillance videos, which is crucial for the analysis of classroom environments [14]. Secondly, in the field of education, studies have shown that deep learning methods, especially CNN, have significant advantages in behavior recognition and analysis. In their research, Li et al. used CNN model to analyze students’ behavior on online learning platforms and achieved good results [15]. This demonstrates the feasibility and effectiveness of similar technologies in the field of education. In addition, YOLO is adaptable and flexible, capable of handling complex and changing classroom environments. In Liu et al.’s study, YOLO was successfully applied to monitor students’ participation and behavior patterns in the classroom, demonstrating its practicability in educational scenarios [16]. To sum up, the choice of YOLO as a research method is not only based on its advanced nature and maturity in the field of object detection, but also on its previous successful cases of specific application in the field of education. These factors together support the applicability and validity of the YOLO network in this study. In order to improve the current teaching level of colleges and universities, this study will adopt a new model based on YOLO algorithm and apply it to the analysis of students’ classroom behavior, so as to help teachers manage the classroom more effectively and improve the quality of students’ classroom learning.
Methodological design of student classroom behavior detection and analysis based on YOLOv5 web network
YOLOv5 algorithm optimization
The You Only Look Once (YOLO) algorithm model is currently the most representative single-stage target detection algorithm with good detection accuracy and excellent detection rate [17]. This study uses the YOLO algorithm model as a network model for target detection and classification to improve the recognition and analysis of student behavior in college labor education courses. YOLO is an end-to-end object detection algorithm model based on CNN model. It has 24 convolutional layers and 2 fully connected layers. The specific structure is shown in Fig. 1.
Structure diagram of YOLO algorithm model.
The YOLO algorithm overcomes the drawbacks of slow target detection and large number of parameters. But in a classroom setting, student behavior detection often involves multiple students and their actions, which can make it difficult for algorithms to accurately identify each student’s specific behavior. To this end, the study improves on YOLOv5 by adding a new feature fusion layer to the original structure, which enables the feature map of the prediction layer to contain more detailed information. The improved network structure is named YOLOv5-O, and its feature fusion schematic is shown in Fig. 2.
Diagram of YOLOv5-O feature fusion.
YOLOv5-O divides the image in the form of a grid, while in the whole computer vision, the process of convolution is like cutting the image into squares of the same size and mapping the squares into new images by means of filters. Convolution is generally divided into two categories: single-channel and multi-channel, and the general expression of the function of convolution is as in Eq. (1).
In Eq. (1),
From Eq. (2),
In Eq. (3),
In Eq. (4),
From Eq. (5), it can be found that the answer is obtained after the derivative of this function, which improves the convergence speed. However, since the problem of back propagation is not possible when the derivative is 0, the Swish function is designed, and its formula is shown in Eq. (6).
In Eq. (6),
From Eq. (7), it can be seen that since this function does not have an upper limit and is non-monotonic, it can update a large number of neurons and possesses better stability than the other functions mentioned above.
As the characteristics of students’ behaviors in class are complex and diverse, in order to ensure the comprehensiveness of the test objectives and fully consider the overall characteristics of behaviors in class, this study roughly divides students’ behaviors in class into four types on the basis of qualitative analysis and preliminary observation, namely, learning, sleeping, drinking and playing electronic devices. These four behavior types are widely considered to be the most common behavior patterns of students in the classroom environment. By analyzing instructional videos and discussing them with education experts, we confirmed that these behaviors are representative of students’ primary activities in the classroom. These behavior types not only reflect the students’ participation, but also directly affect the teaching effect and students’ learning outcomes. Moreover, these behaviors can be easily observed and identified in video surveillance data, which is particularly important for automated detection using computer vision technology. The criteria for determining these four types of behaviors are shown in Table 1.
Student behavior judgment and accuracy
Student behavior judgment and accuracy
As shown in the content of Table 1, although student behavior can be roughly divided into four types, in real life, student behavior detection algorithms can be affected by a series of different data sets, algorithm accuracy, etc. There are many current algorithms on student behavior detection, but most of them have relatively single data, and their detection accuracy as well as speed are not yet able to meet the requirements [19]. Therefore, the study addresses the current problems and proposes a detection dataset with research data from classroom videos, and the study uses image semantic target detection methods for student behavior detection. Because the YOLOv5-O algorithm can be used for behavior detection under a certain space, it is used as the base algorithm for this study to detect the four behaviors mentioned above. The general flow of detection is shown in Fig. 3.
Student behavior detection process.
As shown in Fig. 3, first train YOLOv5-O by preprocessing videos and classifying comments, feature extraction of the images through local networks, and feature fusion at different scales, and finally their classification and regression to output prediction frames for the purpose of detecting student behavior. Since behavior detection suffers from irregular student seating, weak light or obscured vision, the performance of the detection device can be affected as a result [20]. Therefore, to restore the validity of the YOLOv5-O as much as possible, the study improves the algorithm to improve its detection accuracy to present a better detection effect, taking into account the characteristics of student behavior detection in the classroom. The YOLOv5-O structure is optimized by removing the nodes that are not fused and adding jumping edges between the input and output nodes for fusing more features, based on the bi-directional cross-scale principle of the Bidirectional Feature Pyramid Network (BiFPN). Since the study only utilizes the cross-scale principle, it not only improves the detection accuracy, but also reduces the network range and improves its detection speed. The study has analyzed the effect that can be achieved by the improved feature network layer by considering a bi-directional path as a module. This is shown in Fig. 4.
PANet structure diagram and improved feature network module.
As Fig. 4, the improved model of the study has one feature network layer and three feature network layers, respectively. In Fig. 4, the study uses different colored circles to represent the different feature levels in the network and how they interact through the jump edge. The white circle represents the initial feature layer of the input data; Purple circles represent the abstract layer of low-level features; The green circle represents the intermediate feature layer; The orange circles represent the advanced feature layer. The arrows show the direction of feature flow in the network, where the red arrow indicates the feature flow upward; the blue arrow indicates the feature flow downward, and the yellow arrow indicates the skip edge connections, which allow features at different levels to communicate information directly. Since students sitting in the back row will appear small in the image, they are likely to be confused with the background, which leads to the missed network selection. Therefore, the study chooses to include an attention module for the purpose of ensuring the accuracy of the algorithm as well as for the purpose of taking into account the space. The block attention mode (BAM) is not complicated and it is a valid attentional module for CNN that can be embedded in most CNN networks to perform adaptive feature refinement on the input feature images [21]. For this channel attention module, the operation of extracting information is first performed using average pooling and maximum pooling, after which the features are merged by the generated feature maps and the features are output. The formulation of this channel attention module is shown in Eq. (8).
In Eq. (8),
In Eq. (9),
Diagram of CBAM module embedded in YOLOv5-O.
In this study, the model performance was evaluated in terms of both precision and detection speed. In the test of model accuracy, in addition to the conventional accuracy, recall, average precision (AP) and mean AP (mAP), the study introduces the F1 score considers the both model accuracy and recall. In addition, the study uses the time taken by the model to perform operations on a single video frame to evaluate the detection speed.
Analysis of algorithm performance detection results
To reduce the interference of systematic errors, the study was conducted in the same test environment, and the computer used for testing had a GTX 1080ti graphics card; Inter Xeon E5 CPU; 64 GB memory; and Windows 10 operating system. This study used a dataset composed of teaching and surveillance videos from a labor education course at an institution of higher learning. These videos record various behaviors and interactions of students in class in detail, providing rich visual information for behavior detection and analysis. The dataset contains 600 hours of video content covering a variety of different classroom environments and teaching activities. The videos were split into a total of 900 short videos, each about 40 minutes long. All videos are recorded in HD format, ensuring clarity and detail in the image quality. In order to train and test the model, these video paragraphs are divided into a training set and a test set. The training set accounts for 80% of the total data and contains 720 videos, while the test set accounts for 20% and contains 180 videos. Comparing the training loss value change curve of YOLOv5 algorithm before and after improvement is shown in Fig. 6.
The loss curves of YOLOv5 and YOLOv5-O.
Figure 6 shows the comparison of the training loss value change curves of the algorithm before and after the improvement. Among them, Fig. 6(a) exhibits the Loss curve of YOLOv5 model, when the iterations of the model reaches 1000, the algorithm loss value of YOLOv is around 0.7. Figure 6(b) shows the Loss curve of YOLOv5-O model, when the number of iterations reaches 1000, the loss value of YOLOv5-O has reached the target requirement. And the time consumed by YOLOv5-O during the iterations to reach the target loss value is only half of that of YOLOv5. YOLOv5-O also has a more stable performance in the iterations after reaching the target loss value. The validity of this study was verified.
Ablation experiments were conducted on the location of embedded CBAM, and AP and mAP were compared. The results are shown in Table 2.
CBAM insertion site ablation experiment
It can be seen from the ablation experiment that mAP without embedding CBAM is 79.5%. The mAP of CBAM embedded in the early layer is 85.7%; The mAP embedded with CBAM in middle layer is 84.9%; The mAP of deeply embedded CBAM is 90.1%. It can be seen that deep embedding has the best effect, so deep embedding is chosen as the embedding layer of CBAM in this study. In order to validate the ability of YOLOv5-O to detect student behavior, the study selected common algorithms currently used in the field of behavior detection for comparison, including Faster Region-Convolutional Neural Networks (R-CNN), Single Shot MultiBox Detector (SSB), YOLOv4, and YOLOv5, The AP and mAP comparison results of the different algorithms are exhibited in the following chart (Table 3).
Comparative experiments on different algorithms in the dataset
Note: Numbers bold indicate best results.
Recall rates under different algorithms.
The data comparison in Table 2 can be obtained that different algorithms perform differently in terms of accuracy in the dataset, with YOLOv5-O having the highest AP, which can reach 91.3%; mAP is 90.1%, which is an improvement of 20.77% compared to YOLOv5 before improvement; and 17.47% compared to Faster R-CN, which is currently widely adopted. For verifying the algorithm’s practical feasibility in detecting students’ classroom behaviors. Classroom behavior data of students in two different classes of labor education courses in universities were collected as datasets, Dataset A and Dataset B, for application analysis. The recall curves of the detection and analysis comparing different algorithms are shown in Fig. 7.
Figure 7 shows the recall rates of different algorithms in 2 different datasets. The horizontal coordinates represent the amount of test samples and training samples, and the vertical coordinates are the recall rate, which shows that the improved YOLOv5 algorithm and the other four traditional algorithms show a decreasing trend as the sum of samples increases. The recall rate of the improved YOLOv5 algorithm decreases to some extent, but the overall is in a more stable state. Figure 7(a) shows the recall curves of dataset A. When the number of samples is at 300, the recall rates of the five different algorithms in the figure, from top to bottom, are 61.6%, 53.3%, 44.9%, 41%, and 37.4%, respectively. Figure 7(b) shows the recall curves for dataset B. The recall rates from top to bottom are 60.5%, 52.7%, 43.8%, 42.3% and 34.4% for the five different algorithms in the figure when the number of samples is 300. YOLOv5-O has the best recall performance in both datasets. The study tested the above dataset several times and counted its accuracy to compare the accuracy of different algorithms, as shown in Fig. 8.
Accuracy under different algorithms.
Figur 8 represents the accuracy performance of the five different algorithms in practical applications. Figure 8(a) shows the accuracy curves of the data set A tests, where YOLOv5-O has the highest accuracy rate in all 12 tests with an average accuracy rate of 85.5%. Figure 8(b) shows the accuracy curves of the data set B tests, where YOLOv5-O also has the highest accuracy rate with an average accuracy rate of 87.9%. It is obvious from the figure that the accuracy rate of YOLOv5-O is greatly higher than the other four types of algorithms, which shows that the YOLOv5-O proposed in the study can provide better results in practical applications.
To finding the connection between the accuracy and recall of YOLOv5-O algorithm in student behavior detection in university labor education courses, and to verify the effectiveness of behavior detection analysis, its PR curve was studied and drawn, and compared with Faster R-CNN, which also has great performance in application. The details are displayed in Fig. 9.
PR curve of YOLOv5 algorithm before and after improvement.
The PR curves of the two different algorithms in the dataset are shown in Fig. 9. The horizontal coordinate indicates the recall rate. The vertical coordinate indicates the precision rate. As the recall rate increases, the precision rate shows a decreasing trend, and the image area formed by the curves of recall and precision rate could be used to judge the performance of the algorithms. From the Fig. 9, it can be seen that the area of the PR curve of the YOLOv5-O algorithm is larger than the green area of the YOLOv5 algorithm, and the comparison of the images reflects more intuitively that the algorithm in the study has better performance after several improvements. Finally, the PR curves of YOLO-v5 algorithm in different behavior detection were drawn to verify the specific effectiveness of detection analysis, as shown in Fig. 10.
PR curve of improved YOLOv5 algorithm.
Figure 10 demonstrates the schematic diagram of the PR curves of the YOLOv5-O algorithm under different behavioral criteria. From the figure, it can be learned that the area constituted by the PR curves of the algorithm finally adopted in the study presents a full state under different behavioral criteria, therefore, it is confirmed that the detection and analysis of students’ classroom behaviors are best under this algorithm, and the algorithm adopted in the study has higher accuracy compared with other common algorithms. Therefore, the new algorithmic model of YOLOv5-O constructed in the study can greatly satisfy the detection and analysis of students’ behavior in the classroom and guarantee the accuracy and validity of the obtained results. In order to further verify the effect of the research method on students’ achievement in practical application, the research conducted a teaching experiment among students of the same grade in a university. A total of 162 students participated in the experiment, and the subject tested was labor education. In order to ensure the fairness and effectiveness of the experiment, students were randomly divided into three groups: one group adopted traditional teaching methods, one group adopted Faster R-CNN assisted teaching, and the other group adopted YOLOv5-O assisted teaching. Groups using traditional teaching methods follow a regular teaching plan and activities. The Faster R-CNN assisted teaching group and the YOLOv5-O assisted teaching group respectively combined corresponding classroom behavior recognition technology to achieve final student participation and provide immediate feedback. The teaching experiment lasted one semester and focused on changes in student achievement, as shown in Fig. 11.
Student achievement table change curve.
Figure 11 temporarily compares the changes of students’ scores under different methods, which is also the most intuitive reflection of the teaching effect. As can be seen from the figure, there was no significant difference between the scores of the three groups of students in the first two achievement tests. In the latter three tests, both Faster R-CNN assisted teaching and YOLOv5-O assisted teaching can improve students’ scores more than traditional teaching methods. In the last test, the average score of students under YOLOv5-O assisted teaching was 89.12 points; The average score of students under Faster R-CNN assisted teaching was 79.76 points; The average score of students under the traditional teaching method is 74.27. As can be seen from the figure, the use of behavior detection algorithm to assist teaching can enable teachers to understand students’ learning status in time, so as to adjust teaching. In addition, since the detection performance of YOLOv5-O is better than that of Faster R-CNN, the teaching effect achieved by YOLOv5-O assisted teaching is the most significant, that is, the improvement of students’ scores is the largest.
Modern education cannot be achieved without smart computers and other devices, to effectively perfect the students’ classes quality and classroom learning efficiency, as well as to improve teachers’ teaching quality. This study proposes a model for detecting student behavior in labor courses based on the YOLOv5-O algorithm, aiming at more effective detection and analysis of student behavior. The study optimizes the network structure based on the YOLOv5 algorithm, and embeds the attention module in the optimal position for improving its accuracy. The study analyzes the performance of the model by comparing the recall, average accuracy, precision, and PR curves among different algorithms. The experimental data show that YOLOv5-O has the highest AP in the test set, which can reach 91.3%; the mAP is 90.1%, which is an improvement of 20.77% compared to the YOLOv5 before improvement; and 17.47% compared to the Faster R-CN, which is widely used at present. In practical applications, the average accuracy of YOLOv5-O is 86.7%, and the area composed by its PR curve under different behavioral criteria shows a full state. In summary, the YOLOv5-O algorithm model can efficiently detect students’ behaviors in the classroom and perform accurate detection and analysis, which verifies the validity and practical application value of this study. Although the study has achieved certain results, it can only achieve some of the functions at present and requires high equipment configuration, which is not suitable for generalized use, and subsequent studies will improve this.
Footnotes
Funding
This study is supported by The Project of Universities’ Philosophy and Social Science Research in Jiangsu Province: Research on the Construction of Labor Education Model in Applied Undergraduate Universities under the Background of Artificial Intelligence (Project NO.2023SJYB2211).
