An improved YOLOv5 algorithm for underwater garbage recognition

Abstract

In response to the challenges of underwater garbage recognition due to insufficient lighting, poor visibility, and complex background interference in underwater environments, this paper proposes an improved YOLOv5s-based underwater garbage recognition algorithm. By integrating the CBAM(Convolutional Block Attention Module) attention mechanism into the C3 structure of YOLOv5s, the ability to capture subtle features of underwater debris has been significantly enhanced. Additionally, GhostConv lightweight convolution layers have been introduced into the model’s neck network, which not only accelerates the computational speed but also ensures the stability of feature extraction. Experiments show that the proposed algorithm achieves a recognition accuracy of 88.0% on a self-built underwater garbage dataset with an average detection time of just 6.8 ms. This improved model surpasses both YOLOv5s and similar algorithms in recognition accuracy and operational efficiency. This research not only greatly enhances the precision and real-time performance of underwater garbage monitoring but also provides an effective solution for the automated monitoring of underwater environments.

Keywords

underwater garbage recognition CBAM attention mechanism lightweight convolution layer YOLOv5

Introduction

With the increase in marine and lake activities and the proliferation of plastic products, underwater garbage pollution has become increasingly severe. Items such as discarded plastic bags, bottles, and shoes often sink to the bottom, posing a serious threat to the health and biodiversity of marine and lake ecosystems.¹ Traditional manual monitoring methods are inefficient and have limited coverage, making it difficult to meet the needs of large-scale identification and cleanup of underwater debris. Therefore, using underwater robots for garbage removal is gradually becoming an efficient solution.² To ensure that underwater robots can accurately identify and locate garbage in complex environments, developing efficient and precise underwater garbage recognition algorithms is crucial.³ However, the unique characteristics of underwater environments, such as insufficient lighting, low visibility, and diverse types of garbage, present higher challenges to existing recognition technologies.

In recent years, vision algorithms based on deep learning have made remarkable progress and development in the field of target detection and recognition.² Currently, common object detection algorithms based on convolutional neural networks include the R-CNN series,^4–6 as well as the YOLO series⁷ and SSD detection algorithms,⁸ which are based on classification and regression.

Wei Liu et al.⁹ proposed the SSD algorithm model, which uses multi-scale detection methods and anchor mechanisms for object detection. Fulton et al.¹⁰ applied advanced deep learning models like SSD, Faster R-CNN, and YOLOv2 to specific underwater environment model training and achieved high-precision detection in the “plastic” category. However, they used only three categories of datasets, limiting the model’s generalization capabilities to other types of garbage. Lin et al.¹¹ adopted the ROIMIX method, integrating Regions of Interest (ROI) from multiple images to improve underwater target detection performance, effectively addressing issues such as low contrast, blur, and color distortion in underwater environments, but without considering the model’s parameter size and computational consumption. Liu et al.¹² used WQT (Water Quality Transfer) and DG-YOLO (Domain Generalization YOLO) technology to enhance detection performance in unknown underwater environments, but due to potentially insufficiently broad or diverse datasets, their generalization capability in complex environments was limited. Ma D et al.¹³ proposed the MLDet deep learning method, which can effectively address challenges in complex underwater environments, but relies heavily on high-end hardware. Although these studies have achieved significant results in underwater garbage recognition, they still face challenges such as low recognition accuracy due to unclear underwater images and low contrast; some models have improved detection accuracy through optimization but struggle to maintain real-time performance.

This paper addresses the above problems by conducting research on an improved YOLOv5s algorithm for underwater garbage recognition. Through the MSR (Multi-Scale Retinex) image enhancement strategy derived from Retinex theory, the clarity and contrast of the images were effectively improved. The integration of attention mechanisms and lightweight convolution modules optimized the YOLOv5s architecture, enhancing the model’s ability to extract key regional features. Through iterative processes of training and testing, the model was continuously refined to ensure high recognition accuracy while enhancing its real-time detection capabilities of underwater garbage, thus providing a more efficient and reliable solution for underwater environmental protection.

Image acquisition and data processing

The underwater debris recognition image dataset used in this study primarily comes from the public dataset on the CVMart platform (https://www.cvmart.net/dataSets/detail/344), supplemented with underwater garbage images captured in real-world aquatic environments, allowing for a more comprehensive testing and evaluation of the algorithm model. The dataset includes various angles, distances, and occlusion scenarios of common garbage, totaling 5500 images, which are divided into training, validation, and test sets at a ratio of 7:2:1. The dataset retains pure underwater environment images without garbage targets as negative samples to prevent overfitting or underfitting and to enhance the model’s robustness against interference. The LabelImg software is used to annotate the segmented dataset, which is categorized into four classes: trash bag , trash bottle, boot trash , and other trash.

Due to the complex underwater environment, some images suffer from severe fogging issues. Therefore, the MSR algorithm¹⁴ was employed to preprocess the dataset by performing a weighted sum of multiple image components, as shown in equation (1), effectively improving the contrast and brightness of underwater images, thereby enhancing their clarity and detail visibility.

\begin{array}{l} R_{n i} (x, y) = \log [S_{i} (x, y)] - \log [F_{n} (x, y) * S_{i} (x, y)] \\ R_{m i} (x, y) = \sum_{1}^{n} ω_{n} R_{n i} (x, y) \end{array}

(1)

In equation (1), the values of i (1, 2, 3) represent the three RGB color channels, where a surrounding Gaussian function is used to filter the RGB channels; $F_{n} (x, y)$ is the scale parameter for the RGB channels, taking values of 15, 80, and 200; $S_{i} (x, y)$ represents the pixel values of the input image’s RGB color components; $ω_{n}$ is the weighted average value, taken as 1/3; $R_{m i} (x, y)$ is the output value of the Retinex algorithm for the RGB channel.

The images enhanced using the MSR algorithm are clearer, with more prominent features and sharper outlines, which facilitates subsequent image processing and analysis tasks. The results of the image dataset processing are shown in Figure 1.

Figure 1.

Images prior to and post MSR processing. (a) Original image of underwater trash bag. (b) MSR-Processed underwater trash bag image.

Research methodology

Principle of the YOLOv5s algorithm

The YOLOv5 series is a set of one-stage object detection models built upon the YOLO concept, designed to achieve a balance between efficiency and accuracy. This series includes four core variants: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. These variants share a similar basic architecture but are adapted to different performance requirements by adjusting two key parameters: depth_multiple and width_multiple. Among them, YOLOv5s is the most lightweight model, featuring a shallower network depth and narrower feature map width, enabling it to run at extremely fast speeds, making it highly suitable for applications with strict speed requirements. YOLOv5s primarily consists of four parts, and the network framework is illustrated in Figure 2.

Figure 2.

Network framework of YOLOv5S.

Input network

The Mosaic data augmentation technique is employed to enhance the model’s generalization capability. The input scaling ratio and anchor box sizes are automatically adjusted according to the size of the input images, allowing the model to adapt to inputs of varying dimensions.

Backbone Network: Consists of modules such as CBS (Convolutional Block with Batch Normalization and activation), C3, and SSPF (Spatial Pyramid Pooling-Fast) for feature extraction. The CBS layer encapsulates convolution, batch normalization, and activation functions to extract image features. The C3 layer groups input features and integrates features across different levels through cross-layer connections, aiming to enhance the model’s representation power. In the Backbone network, the C3 layer utilizes the CSP1_X structure, whereas in the Neck network, the C3 layer without shortcuts adopts the CSP2_X structure. The SSPF layer performs pooling operations on feature maps of different scales, reducing computational load while improving the model’s feature extraction capability, as illustrated in Figure 3.

Figure 3.

Structure of the SSPF layer.

Neck network

Processes and efficiently utilizes the feature maps extracted by the Backbone network. It is composed of convolutional layers, upsampling, and the C3 structure of CSP2_X. The network employs a top-down followed by a bottom-up feature fusion cyclic pyramid structure, which not only shortens the propagation path of feature information but also integrates features from different layers of the image, enriching the feature information for mask prediction in the Output network.

Output network

Contains at least three detection layers, which detect the position and class probabilities of objects by identifying features from the feature maps.

Optimization of the backbone network’s C3 layer

This study embeds the CBAM at the output end of the C3 structure used in the backbone network to extract features and positional information of regions of interest. This enables adaptive feature optimization of the information passing through the C3 structure, thereby constructing the CBAMC3 module to enhance the feature extraction performance of the C3 structure. The structure of the CBAMC3 module is illustrated in Figure 4.

Figure 4.

CBAMC3 structure diagram.

Among these, CBAM is a lightweight convolutional attention module that includes two sub-modules: CAM (Channel Attention Module) and SAM (Spartial Attention Module). The CAM module enhances the extraction of contours of features of interest, with the channel attention calculation formula as shown in equation (2). The SAM module enhances the extraction of positional information of features of interest, which not only saves parameters and computational resources but also makes it easy to plug into existing network frameworks. The spatial attention calculation formula is shown in equation (3).

\begin{array}{l} M_{C} (F) & = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{\max}^{c}))) \end{array}

(2)

In equation (2), C represents the number of channels, $σ$ represents the function $S ig m o i d$ , $F_{a v g}^{c}$ represents the feature map after average pooling, $F_{\max}^{c}$ represents the feature map after max pooling.

\begin{array}{l} M_{s} (F) & = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)])) \\ = σ (f^{7 \times 7} ((F_{a v g}^{s}); (F_{\max}^{s})) \end{array}

(3)

Optimization of the neck network’s convolutional layers

This study optimizes the standard convolutional layers in the neck network using the GhostConv lightweight building block, accelerating the feature extraction process of the model to meet the specific requirements for real-time performance and efficiency in underwater garbage detection scenarios. The core of the Ghost module is to generate more “ghost” feature maps from existing feature maps through linear operations, thereby improving the computational efficiency of the network. The structure of the GhostConv module is illustrated in Figure 5, consisting of convolution, linear transformations, and feature map concatenation.

Figure 5.

Structure of GhostConv.

GhostConv convolution reduces the number of channels in the input feature maps $F^{h \times w \times c}$ by convolving them with an n-dimensional convolution kernel of size $k \times k$ . After performing a single identity transformation and linear operations, more feature maps are obtained and concatenated to form the output feature maps $F^{h^{'} \times w^{'} \times c}$ . By comparing the computation and parameter ratios, as shown in equations (4) and (5), where r_s is the ratio of computational costs between regular convolution and GhostConv, and rc is the ratio of parameters, d × d is the size of the convolution kernel for the linear operation, and s is the number of linear transformations, with s<<c. Regular convolution is approximately s times that of GhostConv.

r_{s} = \frac{n * h^{'} * w^{'} * c * k * k}{\frac{n}{s} * h^{'} * w^{'} * c * k * k + (s - 1) * \frac{n}{s} * h^{'} * w^{'} * d * d} \approx \frac{s * c}{s + c - 1} \approx s

(4)

r_{c} = \frac{n * c * k * k}{\frac{n}{s} * c * k * k + (s - 1) * \frac{n}{s} * d * d} \approx \frac{s * c}{s + c - 1} \approx s

(5)

CCGS-YOLOv5sS network framework

The C3 layer in the Backbone network is integrated with the CBAM structure, which enhances the features of regions of interest and improves the extraction of positional information. The GSConv structure is used in the convolutional layers of the Neck network, reducing the model’s parameters and computational load while maintaining the original model’s detection accuracy. The CCGS-YOLOv5sS algorithm is obtained by replacing the corresponding C3 layers and Conv layers in the YOLOv5S algorithm with the CBAMC3 module and GhostConv blocks, respectively. The overall framework is illustrated in Figure 6.

Figure 6.

CCGS-YOLOv5s network architecture.

Input images pass through the convolutional layers of the Backbone to obtain different feature maps. These feature maps then undergo four rounds of processing through CBS and CBAMC3 layers to extract the target features of the regions of interest. The extracted features are then fed into the SSPF (Spatial Pyramid Pooling-Fast) layer, which converts them into fixed-dimension feature vectors and performs feature fusion. The Neck network performs upsampling, enlarging the feature maps and fusing them to produce three feature maps. Finally, the detection layers in the Output network generate the final output results.

Results and analysis

The experimental data training server was provided by the AutoDL computing power cloud platform, using Python as the programming language. The server hardware configuration includes an NVIDIA RTX 3070 GPU with 8GB of VRAM; a 12-core Intel(R) Xeon(R) Platinum 8255C CPU running at 2.50 GHz; and 32GB of RAM. The experimental platform configuration includes an NVIDIA RTX 3050 GPU with 8GB of VRAM and an AMD Ryzen 5 5600H CPU, as shown in Table 1.

Table 1.

Experimental hardware configuration.

	Server	Experimental platform
CPU	12-core Intel(R) Xeon(R) Platinum 8255C	AMD Ryzen 5 5600H
GPU	NVIDIA RTX 3070	NVIDIA RTX 3050 GPU
Memory	43GB	16GB
External storage	50GB SSD	512GB SSD
Operating system	Linux	Windows 11

The experimental results were obtained with input image dimensions of 640×640 pixels, 200 iterations, and a training batch size of 8 samples.

MSR image enhancement algorithm experiment

To validate the effectiveness of the MSR image enhancement algorithm, we specifically selected underwater images with poor lighting and low contrast, as shown in Figure 7(a), where the object detection confidence levels were only 87% and 67%. After applying the MSR algorithm to optimize the images, the contrast and brightness improved significantly, and the detection confidence levels increased to 91% and 70%. This demonstrates that processing images with the MSR algorithm effectively improves the clarity of underwater debris images, enhances the visual contrast between debris and the surrounding environment, highlights the key features of target objects, and substantially increases the precision of the detection algorithm, thereby enhancing the model’s ability to recognize underwater debris, as shown in Figure 7(b).

Figure 7.

Comparison of MSR enhancement algorithm experiment. (a) No MSR enhancement algorithm. (b) MSR enhancement algorithm.

FPS experiment comparison

The performance of CCGS-YOLOv5s and YOLOv5s was evaluated through FPS (detection speed), as shown in Table 2. In terms of model size, CCGS-YOLOv5s is approximately 19.5% smaller than YOLOv5s. When running in a GPU environment, the FPS of CCGS-YOLOv5s increased by about 32.9%, and the average detection time decreased by 25.3%. Even when executed in a CPU environment, the FPS of CCGS-YOLOv5s increased by about 16.3%, and the average detection time decreased by 13.8%. From the experimental results, The CCGS-YOLOv5s algorithm not only effectively compresses the model size but also significantly improves detection speed. Particularly in a GPU environment, its performance is notably better than that of YOLOv5s, confirming the effectiveness and practicality of the algorithm optimization.

Table 2.

FPS comparison of two models.

	Model size (MB)	GPU detection Speed (FPS)	CPU detection speed (FPS)	Average detection time on GPU (ms)	Average detection time on CPU(ms)
CCGS-YOLOv5s	32.5	146.1	27.8	6.8	36.1
YOLOv5s	40.4	109.9	23.9	9.1	41.9

Detection performance experiment comparison

The detection performance of CCGS-YOLOv5s and YOLOv5s was evaluated using randomly selected underwater garbage images, as shown in Figure 8. In Experiment Group 1, the YOLOv5s algorithm model exhibited missed detections. In Experiment Groups 3 and 4, the YOLOv5s algorithm model had false detections, incorrectly identifying irrelevant information in the images as plastic bottles. Across all four experiments, the CCGS-YOLOv5 algorithm showed higher detection confidence compared to the YOLOv5 algorithm model, indicating better recognition performance and robustness. However, Experiment Group 4 revealed that the model also had false detections for targets that were not significantly different from the background.

Figure 8.

Comparison of MSR enhancement algorithm experiment. (a) Images; (b) YOLOv5 algorithms; (c) CCGS-YOLOv5s algorithms.

Ablation study

To verify the performance improvements of the YOLOv5s algorithm after optimization, we conducted comparative tests on models integrated with the CBAMC3 attention mechanism, GhostConv convolution, and the improved CCGS-YOLOv5s model, against the original model. The performance of each version of the model was evaluated through a series of ablation experiments. The experimental results are presented in Table 3.

Table 3.

Comparison of ablation study results.

Model	Model size (MB)	Detection speed (FPS)	Mean average precision (mAP/%)	Number of parameters
YOLOv5s	40.4	109.9	84.1	7.03 × 106
YOLOv5s + CBAMC3	43.7	97.8	89.8	7.07 × 106
YOLOv5s + GhostConv	26.4	149.2	82.6	6.95 × 106
CCGS-YOLOv5s	32.5	146.1	88.1	6.99 × 106

From the data in the table, it can be observed that embedding the CBAM module in the C3 backbone network layer increases the model size by 8.2% and reduces the FPS by 11%, but significantly improves the mAP by 6.8%, indicating that CBAM effectively enhances the model’s ability to capture key features. Replacing the convolutional layers in the neck network with GhostConv sacrifices a minor 1.8% in mAP but significantly reduces the model size by 34.7% and effectively increases the FPS by 35.8%, demonstrating that GhostConv ensures efficient acceleration with minimal precision loss. The CCGS-YOLOv5s model, optimized with both CBAM and GhostConv, achieves an mAP improvement of 4.8%, reduces the model size by 19.6%, and accelerates the FPS by 32.9%, demonstrating stronger real-time detection capabilities and higher detection accuracy while ensuring detection precision.

Conclusion

This paper focuses on common underwater garbage such as trash bags, discarded shoes, abandoned bottles, and other types of underwater waste. Addressing the challenges posed by underwater environments, including poor lighting, low visibility, and complex background interferences, an improved YOLOv5 algorithm is proposed. The main conclusions are as follows:

(1) To address issues such as insufficient lighting, poor visibility, and interference from complex backgrounds in underwater environments, a multi-scale Retinex image enhancement algorithm was applied to preprocess the underwater debris images, effectively improving the image quality. In terms of the quantity of images, additional garbage pictures captured in actual water environments were added to enhance the assessment of the improved model’s robustness and detection accuracy.

(2) Based on the YOLOv5 architecture, an attention mechanism is embedded within the C3 backbone network to enhance its ability to extract features of image targets. Meanwhile, in the neck network, the ordinary convolutional blocks are replaced with the GhostConv lightweight convolutional module, which reduces the model’s parameters and computational costs, improving the recognition speed of the model to meet the demands of real-time recognition.

In summary, the CCGS-YOLOv5s model proposed in this paper demonstrates the vast potential of deep learning in specific applications, opening new avenues for subsequent underwater environment monitoring and protection. However, while lightweight convolutional structures reduce the model’s parameter count, they may also decrease recognition accuracy for small or multiple targets, leading to significant deviations in detection positioning. Further research is needed to optimize the model’s detection performance.

Statements and declarations

Footnotes

Conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Natural Science Foundation of Jiangxi Province (20224BAB202033) and Research Project on Teaching Reform in Higher Education Institutions of Jiangxi Province (JXJG-22-70-7).

ORCID iD

Bo Guo

References

Chen

Fang

Zheng

, et al. Microplastic pollution in wild commercial nekton from the South China Sea and Indian Ocean, and its implication to human health. Mar Environ Res 2021; 167: 105295. DOI: 10.1016/j.marenvres.2021.105295.

Shih

Chen

, et al. Towards underwater sustainability using ROV equipped with deep learning system. In: 2020 International Automatic Control Conference (CACS), Hsinchu, Taiwan, 4–7 November 2020, 2020. DOI: 10.1109/CACS50047.2020.9289788.

Xue

Huang

Wei

, et al. An efficient deep-sea debris detection method using deep neural networks. IEEE J Sel Top Appl Earth Obs Rem Sens 2021; 14: 12348–12360.

Girshick

Donahue

Darrell

, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 23-28 June 2014.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 2015: 28.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Machine Intell 2015; 37(9): 1904–1916.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.

Jiang

Wang

. Underwater object detection based on improved single shot MultiBox detector. In: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, December 24-26 2020.

Liu

Anguelov

Erhan

, et al. SSD: Single Shot MultiBox Detector. CoRR 2015, abs/1512.02325.

10.

Fulton

Hong

Islam

, et al.

Robotic detection of marine litter using deep visual detection models

2018. DOI: 10.48550/arXiv.1804.01079.

11.

Lin

Zhong

Liu

, et al.

RoIMix: proposal-fusion among multiple images for underwater object detection

2019. DOI: 10.48550/arXiv.1911.03029.

12.

Liu

Song

Ding

. WQT and DG-YOLO: towards domain generalization in underwater object detection 2020.

13.

Wei

, et al. MLDet: towards efficient and accurate deep learning method for Marine Litter Detection. Ocean Coast Manag 2023; 243: 106765. DOI: 10.1016/j.ocecoaman.2023.106765.

14.

Liu

Wang

, et al.

YOLOX: exceeding YOLO series in 2021

2021. DOI: 10.48550/arXiv.2107.08430.