Hazardous behavior detection based on improved Yolov5s model and human posture estimation in chemical and bioengineering laboratory safety management

Abstract

With the increasing importance of safety management in chemical and bioengineering laboratories, how to quickly and accurately detect and warn dangerous behaviors is becoming a popular topic. In this study, a dangerous behavior detection method based on improved You Only Look Once v5s model and human posture estimation is proposed. It is applied to the safety management of chemical and bioengineering laboratories. First, by improving the You Only Look Once v5s model, the detection accuracy and speed are improved. Then, the detected dangerous behaviors are further analyzed and judged by combining the human posture estimation technique. Finally, an indoor personnel safety video surveillance system is developed by adopting a distributed architecture and modularized design strategy. The results validated that the precision of the improved model was 97.68%, the recall rate was 96.79%, the detection time consumed was 0.022 s, and the GPU performance was 10.1. The method can quickly and accurately detect and warn dangerous behaviors. The optimized model designed by the study has important practical application value.

Keywords

chemical and bioengineering laboratories safety management risky behavior detection Yolov5s model human posture estimation

Introduction

As important places for scientific research, the safety management of chemical and biological engineering laboratories is of vital importance. However, the current laboratory safety management is facing many challenges. On the one hand, laboratories involve a large number of hazardous chemicals, biological agents and complex experimental equipment.¹ A slight mistake may cause safety accidents, such as chemical leakage, fire and explosion, posing a serious threat to personnel safety and laboratory property. On the other hand, the traditional safety management approach mainly relies on manual monitoring and regular inspections.² This method is not only inefficient but also prone to causing potential safety hazards to be overlooked due to human negligence. For instance, during the experiment, researchers may violate the operation rules due to fatigue or negligence, such as not wearing protective equipment correctly or having improper postures when handling hazardous chemicals. All these behaviors may cause danger. In addition, there is frequent personnel turnover in the laboratory, and different personnel have varying degrees of understanding and implementation of safety regulations, which also increases the difficulty of safety management.³ With the rapid development of computer vision technology, object detection and human pose estimation techniques based on deep learning have gradually matured and have been widely applied in multiple fields. The Yolov5s model, as an efficient and lightweight object detection algorithm, has the advantages of fast detection speed and high accuracy, and is capable of real-time identification and location of personnel and objects in the laboratory. Human pose estimation technology can detect the key points of the human body, analyze and judge the movements and postures of personnel, thereby providing an important basis for the identification of dangerous behaviors. Introducing these two technologies into the safety management of chemical and biological engineering laboratories can not only make up for the deficiencies of traditional manual monitoring, but also achieve real-time and automated monitoring of the laboratory environment, promptly detect and warn of potential dangerous behaviors, effectively reduce the probability of accidents, and enhance the overall safety of the laboratory.^4,5 Therefore, it is of great practical significance and urgency to study the detection method of dangerous behaviors based on the improved Yolov5s model and human pose estimation, and apply it to the safety management of chemical and biological engineering laboratories.

The contribution of hazardous behavior detection based on improved Yolov5s model and HPE in the applied research of chemical and biological engineering laboratory safety management is mainly reflected in the following aspects. The first is to improve the detection accuracy. The Yolov5s model has been improved to enhance the capability of feature extraction and target recognition, thus improving the accuracy of DBD. This improvement enhances the accuracy of identifying and warning of potentially dangerous behaviors, reducing the occurrence of false and missed alarms. Additionally, the second is to enhance safety by detecting and warning of dangerous behaviors in a timely manner, providing a more reliable guarantee for laboratory safety management. This helps to reduce the probability of accidents, reduce the risk of casualties and property damage, and improve the overall safety of the laboratory. The third is to assist HPE. The introduction of HPE technology as an auxiliary judgment can more accurately judge whether a behavior is a dangerous behavior. By identifying multiple types of human gestures, the classification and early warning of behaviors can be further refined to provide more detailed and accurate information. The fourth is to promote intelligent management, and the DBD method based on deep learning provides technical support for the intelligent management of CBL. Automated and intelligent detection methods can reduce the burden of manual monitoring, improve management efficiency, and decrease the risk of human error and omission. Fifth, this approach provides a research basis for detecting risky behavior and offers new methods and ideas for subsequent research.

The innovation of the research is mainly reflected in the following aspects. Firstly, in response to the actual needs of safety management in chemical and biological engineering laboratories, targeted improvements were made to the Yolov5s model. By introducing depth-separable convolution and grouped convolution, the computational complexity and the number of parameters of the model were optimized. Meanwhile, the CBAM attention mechanism was introduced to further enhance the model’s ability to focus on the target features, thereby improving the accuracy and real-time performance of dangerous behavior detection. Secondly, the research combined the human pose estimation technology with the improved Yolov5s model to form a complete dangerous behavior detection system. This combination not only makes full use of the advantages of human pose estimation in action analysis, but also achieves accurate recognition and classification of human poses in complex environments through multi-scale feature fusion and spatio-temporal graph convolutional neural network (ST-GCN), further improving the accuracy and reliability of dangerous behavior detection. In addition, the research has developed an indoor personnel safety video monitoring system based on a distributed architecture and modular design. This system can monitor the personnel activities in the laboratory in real time and provide rapid early warnings for the detected dangerous behaviors, providing intelligent technical support for laboratory safety management.

The paper is segmented into four parts. The first part reviews the literature, which discusses and analyzes the current research status of human area detection algorithms and HPE algorithms in video dangerous behavior judgment at home and abroad. The second part proposes an improved Yolov5s object detection model and designs the overall process of HPE. The third part verifies the effectiveness and performance of this method through experiments. The fourth part summarizes the research results.

Related works

There are a lot of studies about the Yolov5 algorithm. Yolov5s is an efficient and lightweight ODA suitable for many real-time detection (RTD) scenarios. To address the issue of low accuracy in detecting underwater targets, Wen et al. proposed an enhanced YOLOv5s network. This was achieved by increasing the number of bottlenecks, embedding a coordinate attention module, and an extrusion excitation module to improve target attention. According to the experimental results, the improved network’s average accuracy increased by 2.4%.⁶ Dai et al. proposed the GCD-Yolov5 for armored target recognition in complex battlefield conditions to achieve real-time performance while reducing missed and false alarms. The GCD-Yolov5 was accurate, significantly improving the recognition ability of armored targets.⁷ To address the issue of low accuracy in small target detection, Tan et al. adopted an improved method connected with the Yolov5s. The improved algorithm significantly increased the detection accuracy of small aircraft and ship targets in complex environments, and greatly reduced the missed detection rate, especially for small targets.⁸ Xu et al. proposed an improved YOLOv5 deep convolutional neural network (CNN) for real-time hand detection and recognition in intelligent service robots. The network’s feature extraction capability for small targets was enhanced by adding an SE attention module to the neck detection layer, resulting in improved detection performance. The experimental results indicated that this method had an accuracy of 99.02%, which is 6.54% higher than the original YOLOv5.⁹ Shen et al. adopted an improved ASFF-Yolov5s-based RTD algorithm for small targets in unmanned aerial vehicles (UAVs) to address issues such as high flight altitude, large changes in target scale, and dense occlusion of targets. The proposed way completed an accuracy value of 32.55% and an F1 score of 39.62%, meeting the task needs of RTD of UAVs aerial images.¹⁰ Zhang et al. adopted a vehicle detection method and Yolov5 to address the poor generalization ability and robustness of traditional artificial feature-based ODA. This algorithm could accurately segment and recognize vehicles based on their edge contours.¹¹

Currently, there are many research methods for HPE. The methods built on deep learning have made significant progress in HPE. Carpenter et al. used computer vision analysis to address the issues of high cost, time-consuming behavior observation, and the need for a large amount of professional knowledge to complete. By calculating and encoding facial movements and expressions based on tablet evaluation, differences in emotional expression could be detected.¹² Pereira et al. adopted a machine learning system for tracking the posture of multiple animals in response to research issues related to their social behavior or natural environment. This method achieved higher accuracy, with a speed exceeding 800 frames per second.¹³ To address the issue of existing methods only focusing on scene images or the driver’s gaze or head posture, Hu et al. adopted a dual view scene approach based on uncalibrated gaze direction. This was feasible and superior to the most advanced methods.¹⁴ Chen et al. adopted a fully convolutional propagation architecture with long jump connections to address the issue of three-dimensional HPE in videos. The model performed better than the original best results on the Human3.6M and MPI-INF-3DHP, and the effectiveness of the model was verified.¹⁵ To address the issue of finding cross perspective correspondence in noisy and incomplete 2D pose prediction, Dong et al. used a multi-directional matching algorithm to cluster the detected 2D poses in all views. The proposed method achieved state-of-the-art on the Campus and Shelf datasets.¹⁶ In response to the problem of HPE, Zheng et al. conducted a comprehensive review of 2D and 3D pose estimation solutions based on deep learning. They investigated and analyzed the performance comparison of different methods, aiming to address the problems of insufficient training data, depth blur, and occlusion. The results indicated that although deep learning methods have achieved high performance in HPE, they still face challenges.¹⁷

In summary, great progress has been made in Yolov5s model and HPE in recent years. Different from the above studies, this study focuses on improving the integration and optimization of Yolov5s model and HPE technology, so as to form an efficient and reliable risk behavior detection system. Through the integration and optimization of the system, the advantages of the two technologies can be fully utilized to achieve more accurate and efficient DBD. The computer vision and HPE techniques are applied to the safety management of CBL. This cross-field application not only broadens the practical application range of related technologies, but also provides new ideas and methods for laboratory safety management.

Laboratory risk behavior detection based on improved Yolov5s

This study utilizes Yolov5 to enhance the model and implements separable convolution and grouped convolution methods to decrease network parameters. Additionally, SPP is retained and the prediction box loss function is optimized. The study also integrates human posture detection for efficient safety management in CBL.

Improving the construction of Yolov5s ODA

Yolov5s is a deep learning-based ODA, and the improved Yolov5s model is a significant improvement in technology compared to previous versions. By adopting more advanced BFENs and introducing attention mechanisms, the model performs more accurately and reliably in target detection tasks. This improvement helps to improve the accuracy of hazardous behavior detection and provides a more effective tool for the safety management of CBL. Although the Yolov5s model performs well in the field of object detection, there are still some problems. The Yolov5s model has been improved through the introduction of an attention mechanism, the enhancement of BFEN, and multi-scale feature fusion. These modifications can enhance the model’s detection accuracy and real-time performance, making it more effective in target detection tasks.¹⁸ Figure 1 shows the structure of the Yolov5.

Figure 1.

Framework structure of the Yolov5 model.

In Figure 1, Yolov5s uses an improved CSP Darknet53 as the backbone network, which is a version based on Darknet53. The backbone network is segmented into two branches by the CSP. One branch is responsible for extracting low-level features, while the other is responsible for extracting high-level features. These features are then integrated through cross-stage connections. This structure improves feature expression ability and model efficiency. In terms of detection heads, Yolov5s adopts the Path Aggregation Network (PAN) structure. It enhances the model’s ability to measure objects at distinct scales by fusing feature maps at different levels through a multi-scale feature fusion method. Compared to ordinary direct convolution operations, the calculation formula is shown in equation (1).

F L o p s = K_{h} \times K_{w} \times C_{i n} \times C_{o u t} \times H \times W

(1)

In equation (1), $K_{h}$ , $K_{w}$ , $C_{i n}$ , and $C_{o u t}$ are the height and width of the convolutional kernel, as well as the quantity of input and output channels. Squeeze-and-Excitation Networks (SElayer) was a sub module of the image classification network structure in the early days, which achieved higher detection results with lower computational costs.¹⁹ The module of compressed excitation can be seen as a simple mapping connection between input and output. Taking the convolution operation in 2D image detection as an example, the specific formula is

u_{c} = v_{c} * x = \sum_{s = 1}^{c} v_{c}^{s} * x^{s}

(2)

In equation (2), $v_{c}$ represents the $c$ -tph convolutional kernel. $x$ represents input. $u_{c}$ represents output. $*$ represents convolution operation. Global pooling is used to achieve global field of view features, compressing feature maps from different channels into output of 1×1×C. The SE module processes feature maps in different channels and adds weight parameters to the feature maps of distinctive channels. Various convolutional kernels generate feature layers for different channels, and each input feature layer undergoes corresponding convolutional kernel processing.²⁰ The specific formula is

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(3)

In equation (3), $W$ represents the width. $H$ represents height. $u_{c} (i, j)$ represents the pixel value. The second part of the SE module is the excitation operation. The real number of 1×1 in the appeal is connected using two layers of weights that can be learned. To increase the generalization performance of the model, the first weight connection network reduces the dimension, and the second weight connection increases the dimension, as shown in equation (4).

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

(4)

In equation (4), $σ$ represents the sigmoid activation function. $δ$ represents the rely activation function. $W_{1}$ represents the weight connection that reduces the dimension. $W_{2}$ represents the weight connection of the ascending dimension. The Convolutional Block Attention Module (CBAM) is an evolution of the appellate channel attention mechanism, which is utilized to strengthen the perception ability of CNN. The specific structure is Figure 2.

Figure 2.

CBAM attention structure map.

The improved CSPNet adopts the Bottleneck * N structure, in which the N value is optimized according to the specific configuration of the model. In the improved version of Yolov5s, the N value is set to 3 to balance the computational efficiency and detection accuracy of the model. The CBAM module is integrated in the middle layer of the backbone network, that is, after passing through several Bottleneck modules, to enhance the discrimination ability of the characteristics of the middle layer. The processed feature layer will be fused multiple times with other feature layers, making better use of the weighted feature layer information, as shown in equation (5).

M_{s} (F) = σ (f^{7 \times 7} (A v g P o o l (F); M a x P o o l (F)))

(5)

In equation (5),

f^{7 \times 7}

is the convolution operation of the 7×7 convolution kernel.

A v g P o o l (F)

and

M a x P o o l (F)

represent mean and maximum pooling, respectively. In order to demonstrate more clearly the influence of the Sigmoid activation function and the ReLU activation function on the attention weight, the study recorded the changes in attention weight, loss values, the number of iterations, and the percentage of accuracy when using these two activation functions at different iteration stages. As shown in Table 1.

Table 1.

The influence of sigmoid and ReLU activation functions on attention weights.

Number of iterations	Activation function	Mean value of attention weight	Loss value	Accuracy (%)
10	Sigmoid	0.62	0.85	91.2
10	ReLU	0.58	0.87	90.8
50	Sigmoid	0.71	0.65	93.5
50	ReLU	0.68	0.67	93.2
100	Sigmoid	0.75	0.55	94.8
100	ReLU	0.73	0.57	94.5
150	Sigmoid	0.78	0.48	95.2
150	ReLU	0.76	0.50	95.0

It can be seen from Table 1 that with the increase of the number of iterations, the mean value of the attention weights using the Sigmoid activation function gradually exceeds that of ReLU, indicating that Sigmoid performs better in the normalization of attention weights and helps the model focus on key features more accurately. Meanwhile, the Sigmoid activation function can achieve lower loss values and higher accuracy under a higher number of iterations. However, the ReLU activation function shows a faster convergence rate in the early iterations, which may be related to its advantages in accelerating the training process. The BFEN of Yolov5 is replaced with a network structure of deep separable convolution and grouped convolution parts.²¹ The specific structural diagram is Figure 3.

Figure 3.

Improved Yolov5s partial model.

In Figure 3, CBL is a standard convolution module, consisting of a common convolution layer (Conv), a batch normalization layer (BN) and a LeakyReLU activation function layer. Grouped CSPNet (G_CSP) is a method that divides the original input into two branches. Each branch performs convolution operations to halve the number of channels. Then, one branch undergoes Bottleneck * N operations. Finally, the two branches are concatenated to produce Bottleneck CSP input and output of the same size. SPP refers to spatial pyramid pool, and CNAM is a channel attention module. The study aims to reduce computational complexity and improves the original Focus layer. By replacing it with a basic convolutional layer, the original image size was reduced from 640 * 640 * 3 to a feature layer of 320 * 320 * 32. In addition, the experiment used CBL standard convolution for image down sampling. While retaining the SPP module, local features at different scales were obtained through pooling operations at different scales, effectively expanding the receptive field of the BFEN. By using deep separability and group convolution, the number of parameters in the overall network is decreased through the decomposition steps of addition and multiplication. In addition, the CBAM has been introduced to reduce computational parameters and improve target detection accuracy.

In conclusion, the improved model replaces the original backbone feature extraction network CSPDarknet53 with a network structure consisting of depth-separable convolution and grouped convolution parts to reduce the computational complexity and network parameters. The CBAM attention mechanism module was introduced and embedded in the last layer of the backbone feature layer network. Through the combination of channel attention and spatial attention, the model’s ability to focus on the target features was enhanced. In terms of feature fusion, the improved Yolov5s retains SPP and fuses feature maps at different levels through a multi-scale feature fusion method, thereby enhancing the model’s detection ability for targets at different scales. Figure 4 shows the Pseudocode for the Improved Yolov5s + HPE Pipeline.

Figure 4.

Pseudocode for the improved Yolov5s + HPE pipeline.

In Figure 4, the provided pseudocode encapsulates the entire workflow of the proposed Yolov5s + HPE pipeline, starting with object detection using the improved Yolov5s model to identify persons within the input frame, followed by human pose estimation on the detected person regions using AlphaPose, and then classifying the detected poses into behaviors with ST-GCN, where dangerous behaviors trigger alerts and logging for further action.

Construction of laboratory monitoring system based on AlphaPose

After making a series of improvements to the Yolov5s model to enhance its detection performance, the research further applied the improved model to the monitoring system of chemical and biological engineering laboratories, and combined with human pose estimation technology to construct a complete dangerous behavior detection system. Figure 5 shows the specific detection process.

Figure 5.

Yolov5s target detection model flowchart.

In Figure 5, deep sort utilizes the Kalman filtering algorithm to predict personnel targets detected by the previous frames of ODA to obtain more stable tracking targets. Then, the tracked indoor personnel area is cropped and input into the key point detection algorithm module, and the previously detected key point information is fed into the graph convolutional network. To classify the current action based on the key point information from the first 30 frames. Next, the camera transmits the results to the mobile end. If there are non-standard behaviors or dangerous behaviors such as accidents in the experiment, a red label will be used and a warning will be issued. If relevant rescue is needed, relevant personnel will be contacted for rescue. Deep sort comprehensively considers motion features and target appearance features in the allocation problem, and uses the square formula of Markov distance to merge motion information, as displayed in equation (6).

d^{(1)} (i, j) = {(d_{j} - y_{i})}^{T} S_{I}^{- 1} (d_{j} - y_{i})

(6)

In equation (6), $d_{j}$ represents the $j$ -tph target detection box. $(y_{i}, S_{i})$ represents the projection of the $i$ -tph trajectory distribution into the measurement space. Deep sort uses threshold processing to threshold the Markov distance within a 95% confidence interval, as shown in equation (7). The Markov distance is a motion model based on Kalman filtering, which predicts and matches the short-term motion trajectory of the target, while the cosine distance focuses on the appearance features of the target and can provide reliable matching basis when the target is occluded or has not been detected for a long time. By linearly weighting the two measurement methods, the model can better adapt to the target tracking requirements in different scenarios and improve the accuracy and robustness of tracking.

b_{i, j}^{(1)} = 1 (d_{i, j}^{(1)} \leq t^{(1)})

(7)

In equation (7), $t = 9.4877$ , when the $i$ -tph trajectory is associated with that $j$ -tph detected target, the value is 1. This value is the optimal value obtained through a large number of experiments and verifications. On the pre-training dataset, this threshold has been proven to be able to balance the importance of motion information and appearance features in the comprehensive measurement of Markov distance and cosine distance, thereby achieving the best target association effect in different scenarios. Furthermore, the selection of this threshold also takes into account the generalization ability of the model in practical applications, ensuring that high tracking accuracy and stability can be maintained in different laboratory monitoring environments. When the uncertainty of motion is low, using Mahala Nobis distance measurement is inaccurate. When this situation occurs, deep sort uses the second correlation metric, as shown in equation (8).

d^{(2)} (i, j) = \min {1 - r_{j}^{T} r_{k}^{(i)} | r_{k}^{(i)} \in R_{i}}

(8)

In equation (8), $r_{i}$ is the eigenvector. The same operation introduces binary threshold processing when Kalman prediction is associated with the detection target, as shown in the specific equation (9).

b_{i, j}^{(2)} = 1 (d_{i, j}^{(2)} \leq t^{(2)})

(9)

In equation (9), the threshold of $t^{(2)}$ is a suitable value selected by deep sort using CNN to generate appearance features on the pre trained set. Two different metrics solve problems from different perspectives. In terms of measuring Markov distance, it has good performance in short-term prediction and matching based on motion data. The cosine distance focuses on the appearance features of the target, solving the matching problem under long-term occlusion. The final correlation problem is determined by the linear weighting of two measurement methods as the final measurement standard. The judgment of whether the deep sort is related depends on whether the measurement value is within both $t^{(1)}$ and $t^{(2)}$ thresholds,²² as shown in equation (10).

c_{i, j} = γ d^{(1)} (i, j) + (1 - γ) d^{(2)} (i, j)

(10)

In the matching stage, deep sort introduces a cascading matching method, mainly used to ensure that the most recent targets are given priority matching weights. The AlphaPose key point detection algorithm was used in the research on human key point detection. To address the issue of inaccurate human body position detected by detectors, AlphaPose has designed a new network structure called regional multi-person pose estimation (RMPE). It can simultaneously detect multiple human bodies, thereby improving detection efficiency.²³ The specific RMPE network structure is Figure 6.

Figure 6.

AlphaPose network structure diagram.

In Figure 6, the process of AlphaPose is roughly as follows: first to input the human body clipping region, then to perform feature processing through the RMPE network, and finally to output the human body key point information. AlphaPose has been improved in the single person attitude estimation phase by introducing a set of symmetric networks, namely, the Spatial Transformer Network (STN) and Spatial DE transformer Network (SDTN). In terms of mathematical principles, the affine transformation in the two-dimensional graph represented by STN is equation (11).

(\begin{array}{l} x_{i}^{t} \\ y_{i}^{t} \end{array}) = [γ_{1} γ_{2} γ_{3}] (\begin{array}{l} x_{i}^{s} \\ y_{i}^{s} \\ 1 \end{array})

(11)

In equation (11), $[γ_{1} γ_{2} γ_{3}]$ represents the space vector of the inverse transformation. ${x_{i}^{t}, y_{i}^{t}}$ represents the coordinates before the anti spatial change. ${x_{i}^{s}, y_{i}^{s}}$ represents the coordinates after the anti space change. Due to the fact that SDTN is generated by STD inverse transformation, there is a relationship as shown in equation (12).

{\begin{cases} [\begin{array}{l} γ_{1} & γ_{2} \end{array}] = [\begin{array}{l} θ_{1} & θ_{2} \end{array}] \\ γ_{3} = - 1 \times [\begin{array}{l} γ_{1} & γ_{2} \end{array}] θ_{3} \end{cases}^{- 1}

(12)

The non-maximum suppression of posture solves the problem of redundant human posture detection, and the relevant elimination criteria are shown in equation (13).

f (P_{i}, P_{j} | \land, δ) = 1 [d (P_{i}, P_{j} | \land, μ) \leq δ]

(13)

In equation (13), it is assumed that a measure of the similarity distance between two postures is $d (P_{i}, P_{j} | \land)$ . Among them, $P_{i}$ and $P_{j}$ are two similar postures. $\land$ is the set of a series of parameters in distance $d$ . When the value of $d (| \cdot |)$ is less than the threshold $δ$ , the output $f (| \cdot |)$ is assigned a value of 1. This means that the redundancy of attitude $P_{i}$ and attitude $P_{j}$ will be eliminated. The study of AlphaPose follows the non-maximum suppression process under the standard mode mentioned above. However, at the distance between two similar poses, both pose distance and spatial distance are used to represent them. The specific formula for attitude distance is equation (14).

H_{S i m} (P_{i}, P_{j} | σ_{1}) = {\begin{cases} \sum_{n} \tanh \frac{c_{i}^{n}}{σ_{1}} \cdot \tanh \frac{c_{j}^{n}}{σ_{1}}, i f k_{j}^{n} i s w i t h B (k_{i}^{n}) \\ 0, o t h e r w i s e \end{cases}

(14)

In equation (14), $B (k_{i}^{n})$ represents the target box with a center point of $k_{i}^{n}$ . At the same time, the size of the $B (k_{i}^{n})$ target box is one tenth of the original target box $B_{i}$ . The $\tanh$ function can filter out poses with low confidence. When two similar postures appear with high confidence, the output value of the equation is approximately 1. The spatial distance is used to measure the spatial distance between different features, and the specific formula is equation (15).

H_{S i m} (P_{i}, P_{j} | σ_{2}) = \sum_{n} \exp [- \frac{{(k_{i}^{n} - k_{j}^{n})}^{2}}{σ_{2}}]

(15)

In equation (15), $k_{i}^{n}$ and $k_{j}^{n}$ are different feature centers.

The spatio-temporal graph convolutional neural network (ST-GCN) was introduced in the human pose recognition and classification stage of the study. ST-GCN focuses more on the relationship between bone connections in space and the changes in key points of the human body in time series, as shown in Figure 7.

Figure 7.

ST-GCN training general flow chart.

In Figure 7, Convolution kernel plays the role of extracting features from images. In particular, the connections between the blue nodes in the convolution kernel can capture the information of the time series. In video or sequence data, continuity in time is crucial for tasks such as motion recognition. The connections between the blue nodes and the green nodes highlight the connection information of the bones in space. Through the convolutional operation, the network can understand and learn the relative positions and relationships of various parts of the human body in space. The input data is first convolved with multi-layer space-time graphs. Through multi-layer convolution, the data is gradually transformed from the original space-time structure into a more abstract and representative feature map. These high-level feature maps end up in a standard softmax classifier. The training process is end-to-end, which increases efficiency by eliminating intermediate data transfer and format conversion. This approach streamlines data processing and feature extraction into a single process. Through this process, the DeepSORT algorithm is able to efficiently process time series data, extract key information, and perform accurate classification. This training method ensures fast model response during inference while maintaining high recognition accuracy. Table 2 displays the mathematical symbols used in the study and their meanings.

Table 2.

Mathematical symbols and their meanings.

Mathematical symbols	Implication
$K_{h}$	The height of the convolution kernel
$K_{w}$	The width of the convolution kernel
$C_{i n}$	Input to the convolution kernel
$C_{o u t}$	The number of output channels of the convolution kernel
$v_{c}$	The fourth $c$ convolution kernel
$x$	Compress the input of the excitation module
$u_{c}$	Compress the output of the excitation module
$*$	Convolution operation
$W$	Breadth
$H$	Altitude
$u_{c} (i, j)$	Pixel value
$σ$	Sigmoid activation function
$δ$	Relu activation function
$W_{1}$	Reduced dimension weight join
$W_{2}$	Raises the dimension of the weight join
$f^{7 \times 7}$	Convolution operation of 7×7 convolution kernel
$A v g P o o l (F)$	Mean pooling
$M a x P o o l (F)$	Maximum pooling
$d_{j}$	The j-th target detection box
$(y_{i}, S_{i})$	The projection of the I-th trajectory distributed into the measurement space
$r_{i}$	Feature vector
$[γ_{1} γ_{2} γ_{3}]$	The space vector of the inverse transformation
${x_{i}^{t}, y_{i}^{t}}$	Inverse space change before the coordinates
${x_{i}^{s}, y_{i}^{s}}$	Inverse space change after the coordinates
$B (k_{i}^{n})$	Target box with center point $k_{i}^{n}$
$k_{i}^{n}$	Feature center

Experimental result

The first step is to train the improved Yolov5s model. The improved Yolov5s is compared with the pre improvement and other relevant models, including speed, accuracy, and recall. Then, to analyze the process of HPE designed in the previous chapter and compare its training results and errors.

Training results of improved Yolov5 network model

For verifying the research method effectiveness, experimental analysis was conducted on the objectives of laboratory personnel. The experimental dataset consists of laboratory personnel dataset and network collected data, covering a total of 800 images. To ensure the rigor and professionalism of the experiment, the dataset was separated into a training and a validation set in a 9:1 ratio. The VOC type dataset was converted into the internationally recognized COCO dataset format, saving the information using a txtfile. In order to improve the generalization ability and robustness of the model, a comprehensive data augmentation strategy was implemented on the dataset of 800 images. This strategy includes random rotation (from −30° to 30°), scaling (70%–130% of the original size), cropping and color transformation (random adjustment of brightness, contrast and saturation) to simulate different lighting conditions and changes in viewing angles. Furthermore, by introducing artificial shadows and highlight areas in the image, the adaptability of the model to illumination changes has been enhanced. For the occlusion problem, simulations of partial and total occlusion of objects were adopted, and the diversity of occlusion scenes was increased by adding other objects to the image. To handle the problem of class imbalance, oversampling was performed on a few classes while undersampling was carried out on the majority classes to ensure that the proportion of each class in the training set was approximately the same. The simulation configuration parameters are shown in Table 3.

Table 3.

The simulation configuration parameters.

Parameter name	Parameter value
Backbone network	CSPNet
Data set	Laboratory personnel data set
Training strategy	Stochastic gradient descent optimizer, learning rate attenuation
Training duration	800 epochs
Input size	640*640 pixels
Batch size	32
Learning rate	Start at 0.001, adjusted for attenuation
Regularization method	L2 regularization
Loss function	Classification, regression, and confidence losses

In Table 3, the study selects a CBL risk behavior dataset for training and validation. To train the model, a stochastic gradient descent optimizer is used and a suitable learning rate attenuation strategy is set. The training duration is approximately 800 epochs to ensure that the model is fully trained. The input size is set to 640 × 640 pixels to accommodate different monitoring scenarios. The batch size is set to 32 to balance training speed and memory usage. To prevent over-fitting, L2 regularization method is adopted. The loss function selects a mixed loss function, including classification loss, regression loss, and confidence loss, to comprehensively consider all aspects of target detection. The models for the comparative experiment include the improved Yolov5s and the original Yolov5s. The iterative diagrams of the loss functions for the two models are shown in Figure 8.

Figure 8.

Loss function and iterative plot of best fitness.

In Figure 8, the training loss function curves of the Yolov5s and improved Yolov5s models are roughly similar. In the first 15 iterations, both show superior performance in fast convergence. And after 300 iterations, the loss function can converge to around 0.1 level. In contrast, due to its more convolutional structure, Yolov5m performs better overall during the training phase, with faster convergence speed and lower loss function values. The improved Yolov5s model also has more advantages in terms of average fitness. The precision (P) and recall (R)of the improved Yolov5s and Yolov5s are exhibited in Figure 9.

Figure 9.

Precision and recall of the IYolov5s and the Yolov5s.

In Figure 9, the P reaches 0.9 after about 10 iterations and converges quickly to around 0.9. The R indicator also shows a fast convergence speed and gradually increases, stabilizing at around 0.88 after approximately 150 iterations. This shows that the improved Yolov5s has fast learning ability and high detection precision. To clearly validate the superiority of the improved Yolov5s, it was compared with the unmodified Yolov5s and Yolov5m. Table 4 is the specific results.

Table 4.

Yolov5s series metrics comparison.

Model	Time T(s)	Number of parameters	GPU floating point operations (GFLOPs)	P (%)	R (%)
Yolov5s	0.037	7125462	15.8	90.41	89.73
Improved Yolov5s	0.022	5963274	10.1	97.68	96.79
Yolov5m	0.071	25473168	51.6	98.44	98.36
Improved Yolov5m	0.039	14362019	27.5	92.78	92.12

From Table 4, the improved Yolov5s model far outperforms the original Yolov5s in overall performance. Compared to the Yolov5s, the improved Yolov5s successfully reduces the number of parameters by 30% and reduces processing time by 0.015 seconds. In terms of P indicators, the performance of the improved Yolov5s is 7.27% higher than that of the Yolov5s. In terms of R index, the improved Yolov5s also achieves a performance improvement of 7.06%. The improved Yolov5s can achieve higher detection accuracy of the original model, and its performance is close to that of the larger Yolov5m model. The detection accuracy of the improved Yolov5s algorithm (Method 1) is compared with the current popular and advanced target detection methods. The methods of comparison include target detection framework based on Detectron2 (Method 2), target detection method based on CenterNet (Method 3), target detection algorithm based on EfficientDet (Method 4), and single-stage target detection algorithm based on RetinaNet, as shown in Figure 10.

Figure 10.

Comparison of detection accuracy of five methods.

In Figure 10, the detection accuracy of Method 1 reaches 93%, which is higher than that of the other four methods, indicating that the improved Yolov5s algorithm has high detection accuracy and stability. The performance comparison of different models is shown in Table 5. The division ratio of the data set is 70% for the training set, 15% for the validation set and 15% for the test set, and the data is hierarchical to ensure that each subset represents the distribution of the entire data set.

Table 5.

Performance comparison of different models.

Models	Accuracy rate (%)	Recall rate (%)	F1 score	mAP (%)	AUC-ROC
Yolov5s	90.41	89.73	0.90	75.6	0.89
Improved Yolov5s	97.68	96.79	0.97	85.2	0.95
Yolov5m	98.44	98.36	0.98	88.5	0.96
Improved Yolov5m	92.78	92.12	0.92	80.3	0.91

It can be seen from Table 5 that the improved Yolov5s model outperforms the original Yolov5s model in all key indicators, especially with significant improvements in F1 Score and mAP, reaching 0.97 and 85.2%, respectively. This indicates that the improved model has achieved a better balance between precision and recall, and has significantly improved the average accuracy of target detection. Furthermore, the improved Yolov5s model also performed outstandingly on AUC-ROC, reaching 0.95, demonstrating its superior performance in classification tasks. Compared with the Yolov5m model, although there is a slight gap in precision and recall rate, the performance on F1 Score and mAP is similar. This indicates that the improved Yolov5s model has fewer parameters and a faster processing speed while maintaining higher performance. These results further prove the validity and practicability of the improved model.

HPE experimental analysis

In the training phase of HPE, the training set video contains many actions, and their information changes rapidly within 1 second. The video frame rate for this experiment is 25 frames, which means there will be 25 images per second. Therefore, the time series parameters are set to 25 frames and are used to infer the action for the next frame. The specific training process is Figure 11.

Figure 11.

Training phase loss function and accuracy.

In Figure 11, the loss function value of the training set tends to smooth after the 25th iteration and ultimately stabilizes at around 0.19. The loss function value of the validation set tends to stabilize after the 75th iteration, with small fluctuations, and ultimately stabilizes at around 0.99. Overall, during the training phase, the model exhibits the advantages of fast convergence and stable convergence. Next, the pre improved and post improved models were used to predict and evaluate the final 20 sets of sample data, and the prediction results are shown in Figure 12.

Figure 12.

Human posture estimation.

In Figure 12, the prediction error of the improved model is relatively large, while the predicted results of it are basically consistent with the actual results. The improved model has better application value and can accurately evaluate human posture. The accuracy of HPE (M1) in this experiment is compared with the current popular HPE model. The comparative models include HPE based on lightweight high-resolution network (M2), HPE based on improved hourglass network (M3), 3D HPE based on self adjusting graph convolution Unset (M4), and 3D HPE based on adaptive perceptron combination network (M5). The comparison between accuracy and recall is Figure 13.

Figure 13.

Comparison of precision and recall rates of five models.

In Figure 13, Model 1 has the highest accuracy and recall rates, stable at 97.68% and 96.79%, respectively. The iteration speed of Model 1 is also the fastest. The accuracy and recall of models 2, 3, 4, and 5 are ultimately stable at over 80%. Therefore, the DBD built on the improved Yolov5s and HPE in this experiment has excellent performance of easy training and high accuracy, and will have good applications in CBL security management.

Model limitation

To more comprehensively evaluate the performance of the model in safety-critical applications such as laboratory safety management, reports on frame rate (FPS) were added in the study, including batch processing and single-image inference speeds, and ablation experiments were conducted, as shown in Table 6.

Table 6.

Model performance and ablation experiment results.

Model type	Yolov5s	Improved Yolov5s	Improved Yolov5s (no CBAM)	Improved Yolov5s (no convolution improvement)	Yolov5m	Improved Yolov5m
FPS (single image)	30	35	32	33	20	25
FPS (batch processing)	100	110	105	107	80	90
Accuracy rate (%)	90.41	97.68	96.54	97.12	98.44	92.78
Recall rate (%)	89.73	96.79	95.43	96.34	98.36	92.12
F1 score	0.90	0.97	0.96	0.97	0.98	0.92
mAP (%)	75.6	85.2	83.1	84.5	88.5	80.3
AUC-ROC	0.89	0.95	0.94	0.94	0.96	0.91

In Table 6, the improved Yolov5s model outperforms the original Yolov5s model in both single-image and batch processing inference speeds, reaching 35 FPS and 110 FPS, respectively. This indicates that the improved model not only maintains high detection accuracy but also enhances computational efficiency. The complete improved Yolov5s model performed exceptionally well in all key performance indicators, especially in terms of precision, recall and F1 score, reaching 97.68%, 96.79%, and 0.97, respectively. The ablation experiment results show that removing the CBAM attention mechanism or improving convolution will lead to a slight decline in performance, but it is still better than the original Yolov5s model, indicating that these components have a significant positive impact on the model performance. Especially the CBAM module, its removal led to a decrease of 1.14% in the precision rate and 1.36% in the recall rate, respectively, a decrease of 0.01 in the F1 score, and a decrease of 2.1% in the mAP, demonstrating the important role of CBAM in improving the model performance.

Conclusion

In response to the various hidden dangers in CBL security, this study proposed a DBD method combined with the improved Yolov5s and HPE, and applied it to CBL security management. The experiment proved that the improved Yolov5s index accuracy reached 0.9 after about 10 iterations, and converged to around 0.9 at an extremely fast iteration speed. The recall rate of the indicator also had an extremely fast convergence speed, and the recall rate gradually increased, maintaining an overall level of around 0.88 when the number of iterations was around 150. In this experiment, HPE achieved a high accuracy of 99%. When conducting DBD, improved Yolov5s HPE could significantly perfect the accuracy and effectiveness of detection. Therefore, this method can quickly and accurately detect and warn dangerous behaviors, and has important practical application value for the safety management of CBL. Through the improved target detection model and HPE technology, dangerous behaviors can be detected and warned more accurately, thus improving the safety of the laboratory. This improvement helps to reduce the probability of accidents and reduce the risk of casualties and property damage. While this study has improved the Yolov5s model, its generalization ability still requires further enhancement when dealing with diverse scenarios and data distributions. The model may experience performance degradation in new environments and situations, which necessitates further research and improvement. Future research could combine other modal information, such as audio and text, with visual information to enhance the accuracy and reliability of risky behavior detection through multi-modal fusion.

Footnotes

ORCID iD

Yinan Guo

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix

References

Lauer

Zhou

, et al. Multi-animal pose estimation, identification and tracking with dereplicate. Nat Methods 2022; 19(4): 496–504.

. Vehicle target detection algorithm based on yolov5. Frontiers in Computing and Intelligent Systems 2023; 3(1): 56–59.

Nanda

. An integrated campus radio frequency identification system on clouds analysis for improved security. Acta electron Malays 2022; 6(1): 04–06.

Ziegler

Sturman

Bohacek

. Big behavior: challenges and opportunities in a new era of deep behavior profiling. Neuropsychopharmacology 2021; 46(1): 33–44.

Vikalwe Shakrani

Mathew Kanyangarara

Parowa

, et al. A deep learning model for face recognition in presence of mask. Acta inform Malays 2022; 6(2): 43–46.

Wen

Liu

, et al. YOLOv5s-CA: a modified YOLOv5s network with coordinate attention for underwater target detection. Sensors 2023; 23(7): 3367.

Dai

Zhao

, et al. GCD-YOLOv5: an armored target recognition algorithm in complex environments based on array lidar. IEEE Photon J 2022; 14(4): 1–11.

Tan

Yan

Jiang

, et al. Approach for improving YOLOv5 network with application to remote sensing target detection. J Appl Remote Sens 2021; 15(3): 036512.

Meng

Fang

. Hand target detection based on improved YOLOv5. Int J Wireless Mobile Comput 2023; 25(4): 353–361.

10.

Shen

Zhang

Yan

, et al. An improved UAV target detection algorithm based on ASFF-YOLOv5s. Math Biosci Eng 2023; 20(6): 10773–10789.

11.

Zhang

Wang

, et al. Research on mine vehicle tracking and detection technology based on YOLOv5. Systems Science & Control Engineering 2022; 10(1): 347–366.

12.

Carpenter

KLH

Hashemi

Campbell

, et al. Digital behavioral phenotyping detects atypical pattern of facial expression in toddlers with autism. Autism Res 2021; 14(3): 488–499.

13.

Pereira

Tabriz

Maliah

, et al. SLEAP: a deep learning system for multi-animal pose tracking. Nat Methods 2022; 19(4): 486–495.

14.

Lev

Hang

, et al. Data-driven estimation of driver attention using calibration-free eye gaze and scene features. IEEE Trans Ind Electron 2021; 69(2): 1800–1808.

15.

Chen

Fang

Shen

, et al. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans Circ Syst Video Technol 2021; 32(1): 198–209.

16.

Dong

Fang

Jiang

, et al. Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Trans Pattern Anal Mach Intell 2021; 44(10): 6981–6992.

17.

Zheng

Chen

, et al. Deep learning-based human pose estimation: a survey. ACM Comput Surv 2023; 56(1): 1–37.

18.

Fang

Luo

Zhao

, et al. Spatial‐temporal semantics and interaction graph aggregation for multi‐agent perception and trajectory forecasting. CAAI Trans Intell Technol 2022; 7(4): 744–757.

19.

Zhou

Gou

Chen

, et al. Analysis of small target detection algorithm based on SSD and YOLOv5. Academic Journal of Computing & Information Science 2023; 6(6): 73–79.

20.

Han

Yuan

Wang

, et al. UAV dense small target detection algorithm based on YOLOv5s. J Zhejiang University (Engin Sci) 2023; 57(6): 1224–1233.

21.

Huang

. A comparative study of underwater marine products detection based on YOLOv5 and underwater image enhancement. Int Core J Eng 2021; 7(5): 213–221.

22.

Schad

Fischer

. Opportunities and risks in the use of drones for studying animal behavior. Methods Ecol Evol 2023; 14(8): 1864–1872.

23.

Guo

Miscatalogue

Koonda

. Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cogn Eng 2022; 2(1): 5–9.