Small-target smoking detection algorithm based on improved YOLOv5

Abstract

The use of general target detection algorithms for small-target detection is computationally costly and has a high missed detection rate. A lightweight small-target detection model based on YOLOv5 is proposed to address this issue.First, a maximum pooling layer is introduced to reduce the number of calculations. Second, Shuffle_Conv is designed to replace the ordinary convolutional layer to reduce model parameters. To further compress the model, the Add fusion method is used in the C3 module, while the GAC3 layer is designed with GhostNet. Finally, Mosaic_9 is introduced to improve the small-target detection without increasing the number of calculations. Compared with YOLOv5, computation and parameters of the improved model are reduced by 84.9% and 39.1%, respectively, and the accuracy is improved by 2%, which is more obvious than that of the original model.

Keywords

Object detection YOLOv5 lightweight small targets

1. Introduction

Smoking is harmful to health and poses a serious risk of fire, which may result in fatality and economic losses. The large number calls for proper management and control of non-smoking public places, which can be achieved using smoking detection algorithms. However, these algorithms fail to detect smoking in small areas because of unable to extract effective features of small targets. Traditional smoking detection methods include smoke sensor detection or artificial detection, but both methods have certain drawbacks. For instance, the infrastructural requirements for using smoke sensors are high. In large public places, the smoke sensors have a limited effect and struggle to function [1, 2]. Similarly, the detection time of traditional detection methods is considerably limited, as detection cannot be performed round the clock. Therefore, it is difficult to exploit traditional detection methods.

As machine learning advances, more model architectures based on convolutional neural networks are constantly being created. A series of basic network models, such as AlexNet, VGG, ResNet, and DenseNet [3], have achieved desirable results in accuracy. However, the size and computational load of the network models make it difficult to perform fast detection. Consequently, creating a lightweight but highly accurate model became a crucial research focus for smoke detection. SqueezeNet [4] adds a 1 $\times$ 1 convolution kernel and uses a squeeze layer $+$ expand layer to reduce the weight parameters. MobileNet [5] uses a deep separable convolution to reduce computation, and ShuffleNet [6] integrates channel shuffle technology to improve the model performance.

Target detection algorithms have evolved from the initial two-stage target detection algorithms, such as Faster R-CNN [7], to the later one-stage target detection algorithms, such as SSD [8]. The YOLO [9, 10, 11, 12]series, until the development of YOLOv5 in 2020, was widely known for its ultra-fast detection speed and became synonymous with target detection algorithms. Shortly after YOLOv4 was proposed, YOLOv5 was proposed; both algorithms are similar, and YOLOv5, which has four versions, can be regarded as an improved version of YOLOv4. Owing to the difference in the number of residual components in the initial CSP module of the network structure, YOLOv5 has different network depths and widths [13]. YOLOv5s is the lightest in the YOLOv5 series; thus, YOLOv5s was selected as the framework for detecting smoking behavior in this study. When using the original YOLOv5s for detection, the number of calculations performed is 15.8 GFLOPS, showing that parameters and calculations is large [14] despite the algorithm’s relative advantages for target detection. To reduce the model weight and improve its accuracy, the following approach was used in this study. (1) The COV $+$ BN $+$ ReLU maximum pooling layer was used to perform the initial downsampling operation on the input image. (2) The convolutional layer in the original model backbone was replaced by combining deep separable convolution with channel shuffle technology. (3) To reduce the weight of the C3 module, “Concat” was replaced with “Add” and then combined with GhostNet [15]. Because detecting smoking behavior is a small-target detection task, the detection ability of the model was improved without increasing the number of computations. Thus, the data enhancement method of the model was changed to Mosaic_9 [16] to improve the model’s detection of small targets.

2. Related technologies

2.1 YOLOv5

In addition to its four basic frameworks, the YOLOv5 algorithm has been iterated in different versions, up to v6.2. Each version is an improvement on the previous version in weight and accuracy [17]. From v4.0, the BottleneckCSP module is replaced with the C3 module to eliminate the first convolution of each bottleneck structure for higher inference speed. In v6.0, the focus layer is replaced by the CBL layer, and the C3 module of the backbone is reduced to 6. Further, the computational cost is reduced through several changes, such as using the SPPF layer instead of the SPP layer. YOLOv5v6.0 has the lowest computational cost and higher accuracy. After comprehensively comparing all versions of YOLOv5, we selected YOLOv5v6.0 as the benchmark experimental model in this study. YOLOv5 divides the overall framework of the YOLOv5s model into four sections (Fig. 1): input, backbone, neck, and prediction. Hereafter, “YOLOv5” represents “YOLOv5sv6.0.”

Figure 1.

YOLOv5sv6.0 framework.

YOLOv5 uses mosaic [18] as the input, randomly cropping, scaling, and stitching four images in the training set. Algorithms prior to YOLOv5 lack adaptive anchor boxes and adaptive image scaling functions. Before training, YOLOv5 outputs the predicted frame according to the anchor frame parameters set in advance; the predicted frame is then compared with the real frame to update the iterative model parameters. When inputting a picture, the commonly used target detection methods scale the input dataset to a fixed size. Owing to the different lengths and widths of the input data, the filling effect also differs, and numerous black borders may affect the detection effect. The adaptive image scaling function solves this problem by determining the minimum number of black border fillings required to improve the detection speed.

In the backbone section, the image is first passed through a CBL module with a 6 $\times$ 6 convolution kernel. The image of the input end 640 $\times$ 640 $\times$ 3 is changed to 320 $\times$ 320 $\times$ 12, and the feature map is changed to 320 $\times$ 320 $\times$ 32 after passing through a 32 $\times$ 32 convolution kernel. In the C3 module, the feature map changes from 640 $\times$ 640 to 20 $\times$ 20 after passing through five 3 $\times$ 3 convolution kernels with a step size of 2. Unlike YOLOv4, YOLOv5 classifies SPPF into the backbone plate to fix the output size of the feature map.

In the neck section, YOLOv5 uses the FPN $+$ PAN structure for network fusion [19]. FPN performs top-down upsampling operations to enhance high-level semantic features. As FPN cannot extract positioning information, PAN performs bottom-up downsampling operations, playing a complementary role to FPN.

In the prediction section, YOLOv5 outputs three anchors of different sizes. Typically, 20 $\times$ 20 anchors are used to detect large objects, 40 $\times$ 40 anchors are used to detect medium-sized objects, and 80 $\times$ 80 anchors are used to detect small objects. GIOU_LOSS is the loss function of Boundingbox, which uses NMS to filter the target box.

2.2 CBR_Maxpooling layer

The focus layer [20] was proposed for YOLOv5 for slicing and stitching the feature maps. The focus acts as a special down-sampling method, which reduces the dimension of the feature map through the down-sampling layer and avoids over-fitting. Subsequently, v6.0 used the CBL layer to replace the focus layer. Although the computational cost of the model was reduced, the CBL layer neither reduced the dimension nor removed redundant information. Considering that the feature map can reduce the dimension and the number of calculations, the maximum pooling layer was introduced, and the CBR_Maxpooling layer replaced the CBL layer to improve the model performance. The CBR_Maxpooling layer comprises convolution $+$ batch normalization $+$ Relu activation function, and the maximum pooling layer is added after the CBR operation.

2.3 Shuffle_Conv

Currently, several lightweight networks reduce the number of computations by replacing ordinary convolution layers with group convolution or depthwise separable convolution. However, because of the dispersion of channels, many convolution kernels are used for their respective channels, resulting in incomplete image feature information; moreover, no connection exists between channels, which not only increases the number of computations but also causes a loss of effective information. To address this problem, channel shuffling was introduced in ShuffleNet [21]. After group convolution, channel shuffling is used to flatten the features in the form of tensors to realize information interaction between channels, as shown in Fig. 2.

Figure 2.

Mixing process of channel.

As shown in Fig. 2, the channels are first divided into three groups and numbered 1 to 9. The 1-dimensional channel is expanded into a 3-dimensional channel using the reshape operation, after which the channel is transposed without increasing computational cost. The information interaction between different channels is then analyzed, and the flatten operation is used to reduce the dimension, re-splicing it into a new feature map. Several lightweight networks use depthwise separable convolutions. A network model considers its MAC as well as FLOPS. For instance, point convolution is computationally costly. Let $c_{1}$ and $c_{2}$ be the number of input and output channels of the 1 $\times$ 1 convolution kernel. Let w and h be the width and height of the feature map; then $\textit{FLOPS}=hwc_{1}c_{2}$ , and set B $=$ FLOPS, $\textit{MAC}=hw(c_{1}+c_{2})+c_{1}c_{2}$ , as shown in Eq. (1). If and only when $c_{1}=c_{2}$ , MAC takes the minimum value, and the model efficiency is the highest.

$\displaystyle\textit{MAC}\geqslant 2\sqrt{hwB}+\frac{B}{hw}$ (1)

ShuffleNet has two versions: ShuffleNetV1 and ShuffleNetV2. Depth convolution (DWConv) is an extreme case of group convolution (GConv). Excessive use of GConv increases the model capacity and consumes more MAC. As shown in Eq. (2), where $g$ is the number of groups, $B=\textit{FLOPS}=hwc_{1}c_{2}/g$ .

$\displaystyle\textit{MAC}=hw(c_{1}+c_{2})+\frac{c_{1}c_{2}}{g}=hwc_{1}+\frac{% Bg}{c_{1}}+\frac{B}{hw}$ (2)

ShuffleNetV2 reduces the use of group convolution and replaces group convolution with ordinary convolution. As shown in Fig. 3, the feature map is processed using ordinary convolution $+$ depthwise convolution $+$ channel shuffling, and Shuffle_Conv is used to replace the convolution layer in the original model backbone.

Figure 3.

Ordinary convolution $+$ deep convolution $+$ channel shuffle process.

2.4 Lightweight C3

C3 is implemented on the structure of CSPNet. The CSP structure divides the feature map into two branches and fuses the feature information through a hierarchical collapse. Fusion methods are of two main types: Concat and Add, which are used to integrate the feature map information. CSP [22] uses Concat to fuse the information of the output layer with the feature framework extracted by convolution. Channels are merged to increase the dimension of the information. Add acts only increases the amount of information. The image dimension does not change, but the amount of information in each dimension increases. Therefore, Add can be regarded as a special form of Concat, where the number of calculations is half that of Concat. Therefore, we replaced Concat in C3 with Add and designed AC3 to replace the C3 layer in the backbone part.

To avoid the accumulation of convolution layers, AC3 is combined with GhostNet, which was proposed in CVPR2020 to solve the problem of using limited computing resources to generate numerous feature maps. The usual approach to improving accuracy is to increase the number of convolution layers and parameters. Although the approach improves accuracy, it is resource-intensive. In the conventional convolution operation, the calculation amount is $w^{\prime}\times h^{\prime}\times c\times k\times k\times n$ .

The concept of a ghost module is proposed in GhostNet, comprising conventional convolution, ghost feature map generation, and feature map splicing.

(1)
After the conventional convolution operation, m intrinsic feature maps are obtained, and the number of calculations in this process is $Y^{\prime}=X\times f^{\prime}=h^{\prime}\times w^{\prime}\times n\times c% \times k\times k$ , Where $Y^{\prime}\in R^{h^{\prime}\times w^{\prime}\times m}$ , $X\in R^{c\times h\times w}$ , $f^{\prime}\in R^{c\times k\times k\times m}$ .
(2)
As shown in Eq. (3), the ghost feature map $y_{ij}$ can be obtained by cheap linear operations ( $\emptyset$ ) for each feature map of $Y^{\prime}$ obtained through conventional convolution.

$\displaystyle y_{ij}=\phi_{i,j}(y^{\prime}_{i}),\forall i=1,\ldots,m,j=1,% \ldots,s$ (3)
(3)
The intrinsic feature maps obtained in step (1) are connected through identity to the ghost feature map obtained in step (2) to obtain the final output. We compare the ghost module with the ordinary convolution, and the operation kernel is set as $d\times d$ (3 $\times$ 3 or 5 $\times$ 5 is recommended). The number of calculations is obtained by Eq. (4).

$\displaystyle r=\frac{n\cdot h^{\prime}\cdot w^{\prime}\cdot c\cdot k\cdot k}{% \frac{n}{s}\cdot h^{\prime}\cdot w^{\prime}\cdot c\cdot k\cdot k+(s-1)\cdot% \frac{n}{s}\cdot h^{\prime}\cdot w^{\prime}\cdot d\cdot d}=\frac{c\cdot k\cdot k% }{\frac{1}{s}\cdot c\cdot k\cdot k+\frac{s-1}{s}\cdot d\cdot d}\approx\frac{s% \cdot c}{s+c-1}\approx s$ (4)

Because $d\times d$ is similar to $k\times k$ and $s\cdot c$ , the ordinary convolution can be simplified into $s\times$ ghost module. As shown in Fig. 4, the convolution layer in the AC3 layer is replaced with GhostConv, while the Bottleneck is replaced with GhostBottleneck, and ALC3 replaces C3.

Figure 4.
GAC3 framework.

3. YOLOv5 improved model

YOLOv5 is widely used today; for example, many engineering-level applications use it for target detection tasks [23]. However, from the perspective of current research and application, YOLOv5 can still be further optimized [24]. In the Table 1, the From column refers to the output from the upper layer, and Arguments refers to the number of input and output channels, convolutional kernel size, step size, and other details for each layer. Table 1 shows that this study mainly optimizes the Backbone part of YOLOv5.

Table 1
Improved overall structure of YOLOv5

Number	From	Params	Module	Arguments
0	$-$ 1	928	CBR_maxpool	[3,32]
1	$-$ 1	3968	Shuffle_Conv	[32,64,2]
2	$-$ 1	6136	GAC3	[64,64]
3	$-$ 1	14080	Shuffle_Conv	[64,128,2]
4	$-$ 1	38880	GAC3	[128,128]
5	$-$ 1	52736	Shuffle_Conv	[128,256,2]
6	$-$ 1	202656	GAC3	[256,256]
7	$-$ 1	203776	Shuffle_Conv	[256,512,2]
8	1	249792	GAC3	[512,512]
9	$-$ 1	656896	SPPF	[512,512,5]
10	$-$ 1	131584	Conv	[512,256,1,1]
11	$-$ 1	0	Upsample	[None,2]
12	[ $-$ 1,6]	0	Concat	[1]
13	$-$ 1	361984	C3	[512,256,1,Flase]
14	$-$ 1	33024	Conv	[256,128,1,1]
15	$-$ 1	0	Upsample	[None,2]
[ $-$ 1, $-$ 4]	0	Concat	[1]
17	$-$ 1	90880	C3	[256,128,1,Flase]
18	$-$ 1	147712	Conv	[128,128,3,2]
19	[ $-$ 1,14]	0	Concat	[1]
20	$-$ 1	296448	C3	[256,256,1,Flase]
21	$-$ 1	590336	Conv	[256,256,3,2]
22	[ $-$ 1,10]	0	Concat	[1]
23	$-$ 1	1182720	C3	[512,512, Flase]

4. Experiments and results

4.1 Dataset production and experimental configuration

The datasets used were crawled Baidu and Google images, screenshots of actors smoking in movies, and mobile phone photographs of people smoking. The images were labeled by labeling to obtain a dataset for the experiment, with people who are not smoking used as the interference item. The batch size of all experiments was 32, and 200 epochs were performed.

Figure 5.

Dataset analysis.

The dataset for the experiment had 4,880 images, and the ratio of the training set to the test set was 8:2, as shown in Fig. 5. The target of the experiment was smoking. The target distribution in the dataset was relatively concentrated, and the dataset was dominated by small targets.

The experimental configuration used in the experiment is presented in Table 2, and subsequent experiments were performed using the configuration.

Table 2

Experimental configuration

Experimental environment	Configuration
Operating system	Windows10
CPU	Intel(R) Core(TM) i5-7300HQ
RAM	16GB
GPU	GTX 1050 Ti
Video memory	4GB
Frame	Pytorch1.9
CUDA	10.1

4.2 Data preprocessing

Smoking detection belongs to the detection of small target. The difficulties in small-target detection include the availability of a few features, high positioning accuracy, few existing datasets, and small target aggregation [25]. Some people have added attention mechanisms and small-target detection layers to the model for optimization [26]. However, the parameters, computation, and complexity of the model will all increase. By using data augmentation, the model can be optimized without increasing computational costs. As shown in Table 3, the number of parameters and calculations of YOLOv5 is increased after adding several attention mechanisms; the calculation cost is also increased significantly after adding a small target detection layer. Modifying the data enhancement method of the model to Mosaic_9 does not increase the number of computations and parameters.

Table 3
Addition of different detection layers to YOLOv5

Algorithm	Precision	Params	GFLOPS
YOLOv5	75.10%	7022326	15.8
YOLOv5_CBAM	75.20%	7107454	16.4
YOLOv5_SE	74.90%	7313278	16.7
YOLOv5_CA	74.60%	7266479	16.6
YOLOv5+ small target	72.16%	7683080	26.9
YOLOv5_M9	75.30%	7063542	15.8

When training the neural networks, the sample data require enhancement to improve its generalization and robustness [27]. The data can be enhanced by flipping, rotation, zooming, cropping, translation, or noise increment [28]. In Section 1, we note that YOLOv5 uses Mosaic for data enhancement. Mosaic randomly selects a reference point, and four images are formed into a large picture around the reference point, as illustrated in Fig. 6. The large picture is called “canvas.” Mosaic divides the canvas into four areas and places one image in each area. The picture does not exceed the boundaries of the canvas: the part within the canvas is filled with shadows, while the part exceeding the canvas is cut out. To improve the detection ability of the model for small targets, Mosaic was improved to Mosaic_9, which randomly cuts and splices nine images simultaneously to enhance the diversity, robustness, and detection capability of the model.

Figure 6.

Mosaic_9 canvas order.

4.3 Experiment and analysis of model improvement process

Table 4 shows a comparison of the performance of different improved models of YOLOV5. After adding the CBR maximum pooling layer to the model, the number of calculations was reduced by 37%. Replacing the CBR layer with the downsampling function of the maximum pooling layer has a significant effect on reducing the number of calculations. By further adding Shuffle_Conv to the model, group convolution reduced the model parameters by 81%; the channel shuffle technology prevented the loss of effective features in the feature map to maintain the model accuracy.

In Table 4, the numbers in parentheses refer to the number of GAC3 layers used by YOLOv5. When the backbone and head sections were replaced with GAC3 layers, the model was compressed to 5M, but the accuracy was also reduced by 4%. As shown in Table 4, YOLOv5_CBR_Shuffle_GAC3(4) performed better than YOLOv5_CBR_Shuffle_GAC3(8) in model accuracy, number of parameters, number of computations, and model size. A comparison of all improved models shows that the optimal improved model was YOLOv5_CBR_Shuffle_GAC3(4), where the backbone part of YOLOv5 is replaced by the GAC3 layer. The model has a size of 7.5M, 4,280,718 parameters, and 2.4 GFLOPs calculations. Compared with YOLOv5, the parameters of the improved model were reduced by 60.9%, the number of calculations was reduced by 15.1%, and the size was compressed by nearly 50%, with a 2% increase in accuracy.

Table 4
Comparison of different improved frameworks of YOLOv5

Algorithm	Precision	Params	GFLOPs	Weight (M)
YOLOv5	75.10%	7022326	15.8	14.1
YOLOv5_CBR	76.30%	7019734	6.1	13.5
YOLOv5_CBR_Shuffle	75.26%	5725654	3.4	11.1
YOLOv5_CBR_Shuffle_GAC3(8)	71.34%	2711358	1.7	5.0
YOLOv5_CBR_Shuffle_GAC3(2)	74.04%	5245766	2.8	10.3
YOLOv5_CBR_Shuffle_GAC3(4)	77.58%	4280718	2.4	7.5

Figure 7.

Accuracy of improved model and YOLOv5.

Figure 7 illustrates the training results of YOLOv5 and YOLOv5_CBR_Shuffle_GAC3(4) in 200 epochs in the same experimental environment. Because YOLOv5_CBR_Shuffle_GAC3(4) uses YOLOv5 pre-training weights, the initial accuracy was low, as shown in Fig. 7. As the number of training rounds increased, the accuracy of the model increased and gradually became stable. The accuracy of the improved model also gradually approached 80%, whereas that of YOLOv5 was approximately 75%, and the training effect of the improved model was better than that of YOLOv5.

The prediction results of YOLOv5 and YOLOv5_CBR_Shuffle_GAC3(4) are presented in Fig. 8(a) and (b). When detecting a single target, both models have the same detection effect, and both detect smoking behavior.

Figure 8.

Close single-target detection effect.

Figure 9(a) and (b) refer to YOLOv5 and YOLOv5_CBR_Shuffle_GAC3(4), respectively. The models achieved a poor detection effect in detecting multiple long-range targets. The data enhancement method of YOLOv5_CBR_Shuffle_GAC3(4) is Mosaic_9, significantly improving its multi-target detection ability (Fig. 9(c)). The model was considerably compressed, and the desired detection effect was achieved. YOLOv5_CBR_Shuffle_GAC3(4)–Mosaic_9 is hereafter referred to as “YOLOv5#.”

Figure 9.

Remote multiple target detection effect.

4.4 YOLOv5# vs. other algorithms

To further measure the performance of the algorithm, YOLOv5# was compared with other phase target detection algorithms using the same experimental equipment. The results are presented in Table 5. All performance aspects of YOLOv5# were compared with those of the more mature YOLO series. The detection accuracy of YOLOv3-tiny [29] and YOLOv4-tiny [30] increased by 8% and 9%, respectively. The number of parameters, number of calculations, and model size had varying degrees of decline. Aside from the YOLO series algorithms, YOLOv5# achieved 6% higher accuracy than SSD and obtained an absolute advantage in other aspects. Compared with the tiny version in the latest YOLOv7 [31], YOLOv5# had 20% higher accuracy under the same experimental conditions and outperformed YOLOv7-tiny in other performance aspects.

The data enhancement method of YOLOv5# was changed to Mosaic_9, which optimized the detection effect and increased the number of parameters, the number of calculations, and the model size. However, the detection accuracy hardly changed. YOLOv5# not only considerably compressed the computational cost of the model, but also achieved a 77.57% detection accuracy, indicating the effectiveness and economy of the model improvement.

Table 5
Comparison of different algorithms

Algorithm	Precision	Parameters	GFLOPs	Weight (M)
YOLOv3-tiny	69.78%	8.7M	12.9	16.6
YOLOv4-tiny	68.10%	6.4M	21.8	12.3
YOLOv5	75.10%	7M	15.8	14.1
YOLOv7-tiny	57.20%	6M	13.1	12
SSD	71.20%	41.18M	387.9	100
YOLOv5#	77.57%	4.3M	2.4	7.5

5. Conclusion

According to actual application scenarios of smoking behavior, YOLOv5# was designed to detect small targets of smoking behavior. The popularity of lightweight networks has attracted research attention to reducing the computational cost of machine learning models. Thus, the CBR maximum pooling layer was used to reduce the dimension of YOLOv5#. The group convolution and channel shuffle techniques reduces the number of model parameters while maintaining the model accuracy. The lightweight C3 layer replaced with the Add function was then combined with GhostNet to compress the model parameters and size further. Experiments showed that the detection accuracy of YOLOv5# was improved from 75% to 77%. The introduction of Mosaic_9 over Mosaic improved the detection effect of the model without increasing the computational cost while enhancing the small-target detection ability of the model. These results show that YOLOv5# considerably reduces the computational cost, maintains the detection accuracy index of YOLOv5, effectively improves the detection ability of the model and value in practical applications.

Footnotes

Acknowledgments

Basic Research Business Fee Project (Excellent Innovation Team Project). Project Name: Deep learning behavior recognition fall detection research. Project Number: 2022CXTD04. Graduate Innovation Fund Program. Project Name: Research on Smoking Behavior Detection Based on YOLOv5s Algorithm. Project Number: XY2023024. Project Name: Zhangjiakou Cigarette Factory Co., Ltd.

References

Omaki

Shields

Buhs

Curtis

Kulak

Luna

, et al. Working With Fire Departments to Adapt and Implement Evidence-Based Programs That Increase Uptake of Smoke Alarms: A Case-Series Report. J Burn Care Res. 2022 Nov 1; 43(6): 1271-6.

Liu

Yuan

Huang

. A fire alarm judgment method using multiple smoke alarms based on Bayesian estimation. Fire Saf J. 2023 Apr 1; 136: 103733.

Kamath

Renuka

. Deep learning based object detection for resource constrained devices: Systematic review, future trends and challenges ahead. Neurocomputing. 2023 Apr 28; 531: 34-60.

Iandola

Han

Moskewicz

Ashraf

Dally

Keutzer

. arXiv.org. 2016 [cited 2024 Jan 23]. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and

<

0.5 MB model size. Available from: https//arxiv.org/abs/1602.07360v4.

Howard

Zhu

Chen

Kalenichenko

Wang

Weyand

, et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications [Internet]. arXiv 2017 [cited 2024 Jan 23]. Available from: http//arxiv.org/abs/1704.04861.

Zhang

Zhou

Lin

Sun

. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In 2018 [cited 2024 Jan 23]. p. 6848-56. Available from: https//openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html.

Ren

Girshick

Sun

. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017; 39(6): 1137-49.

Liu

Anguelov

Erhan

Szegedy

Reed

, et al. SSD: Single Shot MultiBox Detector. In: Leibe

Matas

Sebe

Welling

, editors. Computer Vision – ECCV 2016. Cham Springer International Publishing; 2016. p. 21-37. (Lecture Notes in Computer Science).

Redmon

Divvala

Girshick

Farhadi

. You Only Look Once: Unified, Real-Time Object Detection. In 2016 [cited 2024 Jan 23]. p. 779-88. Available from: https//www.cv-foundation.org/openaccess/content_cvpr_2016/html/Redmon_You_Only_Look_CVPR_2016_paper.html.

10.

Redmon

Farhadi

. YOLO9000: Better, Faster, Stronger. In 2017 [cited 2024 Jan 23]. p. 7263-71. Available from: https//openaccess.thecvf.com/content_cvpr_2017/html/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.html.

11.

Redmon

Farhadi

. YOLOv3: An Incremental Improvement [Internet]. arXiv 2018 [cited 2024 Jan 23]. Available from: http//arxiv.org/abs/1804.02767.

12.

Bochkovskiy

Wang

Liao

HYM

. YOLOv4: Optimal Speed and Accuracy of Object Detection [Internet]. arXiv 2020 [cited 2024 Jan 23]. Available from: http//arxiv.org/abs/2004.10934.

13.

Zhang

Chen

Yang

Zhang

. Real-time strawberry detection using deep neural networks on embedded system (rtsd-net): An edge AI application. Comput Electron Agric. 2022 Jan 1; 192106586.

14.

Jeppesen

Jacobsen

Inceoglu

Toftegaard

. A cloud detection algorithm for satellite imagery based on deep learning. Remote Sens Environ. 2019 Aug 1; 229: 247-59.

15.

Han

Wang

Tian

Guo

. GhostNet: More Features From Cheap Operations. In 2020 [cited 2024 Jan 23]. p. 1580–9. Available from: https//openaccess.thecvf.com/content_CVPR_2020/html/Han_GhostNet_More_Features_From_Cheap_Operations_CVPR_2020_paper.html.

16.

Cardellicchio

Solimani

Dimauro

Petrozza

Summerer

Cellini

, et al. Detection of tomato plant phenotyping traits using YOLOv5-based single stage detectors. Comput Electron Agric. 2023 Apr 1; 207: 107757.

17.

Tingting

Huanyu

Junbao

. Deep Learning Target Detection System Based on Airborne Image. Radio Eng [Internet]. 2019 [cited 2024 Jan 23]; Available from: http//en.cnki.com.cn/Article_en/CJFDTotal-WXDG201909002.htm.

18.

Fang

Jia

Liu

. Research on sunken & submerged oil detection and its behavior process under the action of breaking waves based on YOLO v4 algorithm. Mar Pollut Bull. 2022 Jun 1; 179: 113682.

19.

Hua

Fan

. Attention-based multi-channel feature fusion enhancement network to process low-light images. IET Image Process. 2022; 16(12): 3374-93.

20.

Chen

Cui

Zhang

. Research on Statistical Algorithm of Microalgae Growth Status Based on Computer Vision. IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference [Internet]. 2021 [cited 2024 Jan 23]. Available from: https//www.zhangqiaokeyan.com/academic-conference-foreign_meeting_thesis/0205118333252.html.

21.

Yang

. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens [Internet]. 2021 [cited 2024 Jan 23]; 13. Available from: http//www.semanticscholar.org/paper/3e653a9d5a7d68893ad926dbda0e9623f2242289.

22.

Liu

Gao

. YOLO-Class: Detection and Classification of Aircraft Targets in Satellite Remote Sensing Images Based on YOLO-Extract. IEEE Access [Internet]. [cited 2024 Jan 23]; 11. Available from: http//ieeexplore.ieee.org/document/10271344/.

23.

Niu

. Improved YOLOv5 network-based object detection for anti-intrusion of gantry crane. 2021 [cited 2024 Jan 23]; Available from: doi: 10.1145/3483845.3483871.

24.

Cao

Zeng

Feng

Wang

Yan

, et al. Research on Airplane and Ship Detection of Aerial Remote Sensing Images Based on Convolutional Neural Network. Sensors. 2020 Jan; 20(17): 4696.

25.

Zhu

Zhao

Zhang

Jin

. Falling motion detection algorithm based on deep learning. IET Image Process. 2022 Sep; 16(11): 2845-53.

26.

Andersen

Peimankar

Puthusserypady

. A deep learning approach for real-time detection of atrial fibrillation. Expert Syst Appl. 2019 Jan 1; 115: 465-73.

27.

Xue

Zhou

Long

Antani

Xue

, et al. Selective synthetic augmentation with HistoGAN for improved histopathology image classification. Med Image Anal. 2021 Jan 1; 67: 101816.

28.

Yun

Han

Chun

Choe

Yoo

. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In 2019 [cited 2024 Jan 23]. p. 6023–32. Available from: https//openaccess.thecvf.com/content_ICCV_2019/html/Yun_CutMix_Regularization_Strategy_to_Train_Strong_Classifiers_With_Localizable_Features_ICCV_2019_paper.html.

29.

Kumar

Yadav

Gupta

Verma

Ansari

Ahn

. A Novel YOLOv3 Algorithm-Based Deep Learning Approach for Waste Segregation: Towards Smart Waste Management. Electronics. 2021 Jan; 10(1): 14.

30.

Cui

Lou

Wang

. LES-YOLO: A lightweight pinecone detection algorithm based on improved YOLOv4-Tiny network. Comput Electron Agric. 2023 Feb 1; 205: 107613.

31.

Sun

Zhang

Wei

Zhou

. A classification and location of surface defects method in hot rolled steel strips based on YOLOV7. Metalurgija. 2023 Apr 3; 62(2): 240-2.

Small-target smoking detection algorithm based on improved YOLOv5

Abstract

Keywords

1. Introduction

2. Related technologies

2.1 YOLOv5

2.3 Shuffle_Conv

Table 1 Improved overall structure of YOLOv5

4.1 Dataset production and experimental configuration

Table 3 Addition of different detection layers to YOLOv5

Table 4 Comparison of different improved frameworks of YOLOv5

Table 5 Comparison of different algorithms

Footnotes

Acknowledgments

References

Table 1
Improved overall structure of YOLOv5

Table 3
Addition of different detection layers to YOLOv5

Table 4
Comparison of different improved frameworks of YOLOv5

Table 5
Comparison of different algorithms