Abstract
Underwater image enhancement has always been a hot spot in underwater vision research. However, due to complicated underwater environment, a lot of problems such as the color distortion and low brightness of underwater raw images are very likely to occur. In response to the above situation, we proposed a generative adversarial network that integrated multiple attention to enhance underwater images. In the generator, we introduced multi-layer dense connections and CSAM modules, of which the former could capture more detailed features and make use of previous features, while the latter could improve the utilization of the feature map. Meanwhile, we improved the enhancement effect of the generated image by combining VGG19 content loss function and SmoothL1 loss function. Finally, we verified the effectiveness of the proposed model through qualitative and quantitative experiments, and compared the results with the performance of several latest models. The results show that the methods proposed in this paper are superior to the existing methods.
Introduction
In recent years, underwater imaging has played an important role in the fields of deep-sea exploration, underwater archaeology, and marine life monitoring. Because of complex underwater environment, the raw images rarely meet the requirements of image processing. Since the physical properties of light in the air and water are slightly different, light in water decays and scatters, which leads to image degradation in underwater images. Color distortion, underexposure, and image blur are the three major problems of image degradation [1]. First, the color of underwater images is often distorted due to factors such as the depth of underwater imaging, lighting conditions, type of water body, and light wavelength [2]. For example, red light attenuates the fastest underwater, and it disappears completely at a depth of 10 meters [3], so the image below the water surface 10 meters will show obvious color distortion. Secondly, water will absorb light energy and cause underexposure of the image, especially when images are taken underwater with unidirectional light sources. Due to the light-absorbing properties of water, objects too far away from the lens are almost indistinguishable [4]. Third, the blurring of underwater images can be attributed to the following two factors. The presence of numerous suspended particles between the camera and the object will cause light to scatter and offset the direction of its propagation. Besides, the contrast of underwater imaging is also affected by the suspended particles between the lens and the object or by light reflected in the water [5]. Finally, the angle between the camera lens and the object also has a certain impact on the final image quality when shooting underwater.
In the past, people used physical-based methods and restoration methods to enhance underwater images. The image enhancement method without prior parameters [6] focuses on adjusting the pixel value of the image to obtain satisfactory results. Image restoration methods [7–9] use atmospheric degradation models to enhance underwater images. However, various complicated underwater physical and optical factors make these traditional methods difficult to implement. Due to the scarcity of data sets that can be used for underwater image enhancement, these methods show poor generalization ability in different underwater images, and the enhanced images of certain scenes tend to be over-enhanced or under-enhanced. In recent years, deep learning technology has made great progress in the fields of computer vision and image processing. Several existing models based on deep convolutional neural networks (CNNs) and generative adversarial networks (GANs) demonstrate the most advanced performance in improving the perceptual image quality when learning a large number of paired or unpaired data sets [10–12]. Especially for underwater images, many GAN-based models [13–15] and CNN-based models [16, 17] have made encouraging progress in color enhancement, defogging and contrast adjustment. However, deep learn-based methods still need to be improved in some complex scenes where the depth information of the scene cannot be accurately estimated.
In order to address these challenges, we propose a fusion of multi-attention generative adversarial network for underwater image enhancement. By assuming that there is a non-linear mapping between the distorted image(input) and the enhanced image(output), we describe the problem as a problem of style transfer from above-water images to underwater images. Then, we propose a GAN model based on the CSAM (Channel attention & Self-Attention Module) module and conduct adversarial training on the EUVP dataset to learn this mapping. The major contributions of this paper are summarized as follows:
(A) We proposed an underwater image enhancement model, namely FMAGAN, which integrated multiple attention mechanisms. Through the combination of CSAM and multi-layer dense connection, the global content, color, local texture and style of the image could be better used for style transfer, color projection correction and image detail restoration.
(B) We used meaningful losses including content loss of VGG19 and SmoothL1 loss to improve the quality of the image generated by the generator.
(C) We performed ablation experiments to further verify the effectiveness of CSAM module. In addition, we did a lot of qualitative and quantitative experiments to evaluate the performance of the model, and compared the proposed model with several advanced models. The results show that FMAGAN has a good enhancement effect on underwater images, and that each evaluation index is better than that of several existing models.
This paper is organized as follows: Section 2 introduces related work; Section 3 gives a detailed description of our technical approach; Section 4 provides results and a discussion of these results; lastly, Section 5 presents summary and future prospects.
Related Work
Single underwork image enhancement
Image enhancement is widely used in human production and life. It is an obvious goal for image enhancement to emphasize some interesting features in the image and suppress the uninteresting features, thereby improving the quality of the image. Classical image enhancement methods use hand-crafted filters to remove uninteresting features from the image and improve contrast/dynamic range to highlight the features of interest in the image [18]. Other approaches use a physics-based atmospheric defogging model, which restores the true underwater color by estimating the transmitted light and ambient light in the scene [19]. In recent work, Akkaynak et al. [20] proposed a revised model based on the atmospheric defogging model and obtained accurate underwater prior parameters by accurately estimating the depth of underwater images. And this model exhibited the best underwater image processing performance. Nevertheless, these methods required numerous images for accurate depth estimation and optical water body measurements as prior.
Over the last decade, with the continuous development of deep learning and the emergence of new data sets, single image enhancement made great progress. The current CNN-based model provides the most advanced performance for image enhancement. For example, Li et al. [16] proposed a CNN-based underwater image enhancement model known as UWCNN. UWCNN can reconstruct a clear potential underwater image. Li et al. [21] constructed an underwater image enhancement benchmark data set (UIEB). He also proposed a CNN-based underwater image enhancement algorithm and trained it on UIEB. These models provide better performance than those using filters.
In addition, GAN-based models [22] have developed rapidly and have achieved encouraging results in image style transformation and image-to-image translation [11]. In the generative adversarial network, a dynamic game process ” ž is formed, which means that the goal of the generator in the training process is to generate realistic pictures as much as feasible to deceive the discriminator. And discriminator tries to separate the pictures generated by generator from the real pictures. In recent work, Guo et al. [23] introduced a multi-scale dense block (MSDB) algorithm called DenseGAN. The algorithm used a combination of multi-scale dense blocks and residual learning to improve the network performance. At the same time, a number of effective loss functions were used to enhance underwater images. Li et al. [24] proposed a GAN-based color correction method for underwater images. Inspired by image-to-image translation, the model uses a cyclic structure including a feedforward network and a feedback network to learn the mapping function between the source domain (in water) and the target domain (in air). Fabbri, C. et al. [13] proposed the Underwater Generative Adversarial Network (UGAN) to improve the quality of underwater images. UGAN chose Wasserstein Gan with Gradient Penalty (WGAN - GP) [25] to impose soft constraints on its input and output through Lipschitz, instead of limiting the gradient within a certain range. At the same time, the condition GAN [26] allowed the constraint generator to generate samples that conform to the pattern or belong to a specific category. This was particularly useful for learning pixel-to-pixel (Pix2Pix) mapping [11] between any input domain (e.g., distorted images) and a desired output domain (e.g., enhanced images).
Attention mechanism
In the beginning, the attention mechanism was only used in natural language processing tasks such as machine translation, recommendation systems and text recognition. The attention mechanism helped them achieve satisfactory results in these tasks. Therefore, people began to introduce the attention mechanism into the field of image processing. In this regard, Google gave researchers their approach [27]. It completely abandoned CNN and RNN structures, but used transformer structure. Since then, more and more attention mechanisms have been used in image processing.
For instance, Mejjati YA et al. [28] suggested that the attention mechanism should be jointly trained with generator and discriminator to obtain a better transfer effect. Chen et al. [29] proposed a model called Attention-GAN. This algorithm added an independent attention network to the generator to produce attention maps, thereby improving the quality of the image generated by the generator. Tang et al. [30] proposed a multi-channel attention selection method to generate new composite images by adjusting the scene image and the semantic map. Finally, this method had achieved good results in the task of cross-view image translation. Wang et al. [31] proposed a residual attention network, which could stack multiple attention modules. Attention perception features from different modules can be adaptively changed with the deepening of learning. Zhang et al. [32] proposed a self-attention generative adversarial network, which could provide attention-driven remote dependency modeling for image generation tasks. Woo et al. [33] proposed a convolution block attention module called CBAM. CBAM used the channel attention module and the spatial attention module to infer the attention map in turn and multiplied the attention map by the input feature map to adaptively refine the features.
Methodology
Model structure of FMAGAN
We made the following assumptions: the source domain X is an underwater unprocessed image, and the desired domain Y is an underwater enhanced image. So it’s our task to get machine learning to map F: X→Y. The first matter we require doing is to let the model learn the style characteristics of the image (color, content, style, etc.). Afterward, we can get the enhanced image by transferring the style features of the waterless image to the underwater image. Due to the excellent performance of generative adversarial network in style transformation, we take it as the basic framework of the model. The generator tries to learn this mapping by constantly playing a game with the discriminator. As shown in Fig. 1, following the principle of U-Net [34], we designed a generator network. We defined d i (i = 1,2,3,4,5) as the down-sampling layer and u i (i = 1,2,3,4) as the up-sampling layer in the network structure. It is an encoder-decoder network (d1 - d5, u1 - u4) with connections between mirror layers, namely (d1, u4), (d2, u3), (d3, u2) and (d4, u1). We also used dense connections for these mirrored encoder-decoder networks. In Fig. 1-b, the input of u1 came from (d3, d4, d5), the input of u2 came from (u1, d3, d2), and so on. The reason for using this method was that deep networks could easily miss the features learned by shallow networks. We could learn more features by the up-sampling layer in the model through these dense connections. Moreover, this method has proved to be very effective for image-to-image translation and image quality enhancement [11]. At the same time, we also added the CSAM module to the U-shaped structure to further strengthen feature extraction and learning capabilities. In FMAGAN, the input of the network was set to 256×256×3, and the encoder (d1 - d5) only learned 256 feature maps with a size of 8×8. The decoder (u1 - u4) learned to generate a 256×256×3 (enhanced) image as an output using these feature maps and densely connected inputs. Additionally, 2D convolutions with 4 × 4 filters were applied to each layer, followed by a nonlinear activation function (Leaky-ReLU) [35] and Batch Normalization (BN) [36]. In Fig. 1, we annotated the feature map size and other model parameters in each layer.

The network architecture and network parameters of the FMAGAN model.
For the discriminator, we adopted the Markov PatchGAN [37] architecture. The PatchGAN mechanism optimized the generator network structure. It optimized the structure similar to U-net, allowing more low-level information to be exchanged. At the same time, PatchGAN allowed the training of the generator to have a gentle gradient like Res-block, which slowed down the disappearance of the gradient to a certain extent and makes the training more effective. As shown in Fig. 1, four convolutional layers were used to transform a 256 × 256 × 6 input (real image and generated image) into a 16 × 16 × 1 output. In each layer, 3 × 3 convolutional filters were used with a stride of 2.
We added CSAM module to the generator’s network structure. Just like the network structure shown in Fig. 3, based on the design idea of the convolutional block attention module, we developed a CSAM module for underwater image enhancement. As shown in Fig. 2, we found that the enhancement effect for such a large-scale underwater image with ocean as the background was not very good in the previous underwater image enhancement model. In the underwater enhancement algorithm based on the physical model, the depth estimation of the image was often inaccurate, which made the color of the underwater image unable to be accurately corrected. In the learning-based underwater enhancement model, the high-dimensional features learned by the model would be biased towards the seawater itself when seawater occupies a large area in underwater images. The color of the enhanced image was bluish or some pixels are garbled, which was the most direct manifestation. In response to this situation, we designed and added the CSAM attention module to the low-level feature map to improve the performance of the network.

CSAM network structure.

Unprocessed images with the ocean as the background in the EUVP data set.

Channel Attention Module network structure.

Self-Attention Module network structure.
The standard GAN model adopts a zero-sum game, and its corresponding object function is:
In formula (1), D represents the generator model, and G represents the discriminator model. For the discriminator,
Similarly, the generator tries to minimize L cGAN and the discriminator tries to maximize L cGAN . We formulated an object function when training the model using EUVP’s paired data set. This function guided the generator to improve the quality of the perceived image, so that the generated image was closer to the respective ground truth in terms of global appearance and high-level feature representation. Specifically, we trained the model using the following objective function:
In formula (4), L real represents the mean square error of underwater clear images which input to the discriminator. L fake represents the mean square error of generated images which input to the discriminator. D* represents the loss of the discriminator. We magnified the value of D* by 10 times to stabilize the training.
We used PyTorch to implement the FMAGAN model. It is trained on paired sets of underwater images. And the data set was composed of images with 256×256. One NVIDIA GeForce RTX 2080Ti graphics card was used for training; the model was trained for 120 iterations with a batch size of 8. We performed ablation experiments, qualitative analysis and quantitative analysis on the model.
Dataset
In terms of data set, we trained and evaluated our model using EUVP data set [38]. With regard to EUVP dataset, the data divided into paired data set, unpaired data set, and test data set. The paired data set was used as the input of the model. The paired data set was divided into two folders, namely TrainA and TrainB. The underwater distortion images were stored in TrainA, and the clear underwater images were stored in TrainB. TrainA and TrainB respectively contained about 12,000 underwater pictures. In Figs. 6 and 7, we showed some sample images of TrainA and TrainB. The input image size was 256×256 and the enhanced image size was also 256×256. We assumed that the image in TrainA was the source domain X of the model input, and that in TrainB was the expected domain Y of the model input. Then, we trained FAMGAN to learn mapping F:X→Y. The setting of the test data set is similar to the paired data set. The test data set contained 515 test pictures.

Underwater distortion images in TrainA.

Clear underwater images in TrainB.
We conducted more experiments to verify the performance of the CSAM module. In the ablation experiment, we used FMAGAN as the baseline. We used test data set for testing, and each model was trained for 120 iterations. Considering that the ablation studies were all variants of the proposed methods, we proposed the following methods for testing:
1) We removed a CSAM module in FMAGAN, and then named this method -CS1.
2) We removed all CSAM modules in FMAGAN, and then named this method -CS2.
3) We removed the channel attention of all CSAM modules in FMAGAN, and then name this method -C.
4) We removed the self-attention of all CSAM modules in FMAGAN, and then name this method -S.
We conducted quantitative analysis on the above four variant models and the benchmark model. Additionally, we labeled the best results of evaluation. We provide the evaluation results in Table 1. It is obvious that FMAGAN achieves the best results on the PSNR and SSIM evaluation criteria. At the same time, it should be noted that the introduction of self-attention greatly improved the calculated amount of the model, but it is undeniable that self-attention also improves the performance of the model. In order to achieve a better balance between model training speed and model performance, we added two CSAM modules into FMAGAN and achieved the best results in various indicators. The training time of the model increases significantly with the addition of CSAM modules. For this reason, we did not insert more CSAM modules in FMAGAN. Finally, by comparing the four variants of the model, we learn that the CSAM module significantly improves the model.
Ablation study for attention modules
Ablation study for attention modules
First, we conduct a qualitative analysis of the image enhanced by FMAGAN, including color restoration and detailed processing. As shown in Fig. 8-a, most of the enhanced images are restored to the same color as the real environment. In addition, as shown in Fig. 8-b, the green hue in the underwater image is corrected and the global contrast is enhanced. These are the main features of the enhanced underwater image. We also make a more detailed comparison of the underwater images enhanced by FMAGAN. We respectively select an enhancement algorithm based on the physical model and an algorithm based on the learning enhancement algorithm to show the enhancement effect of FMAGAN more specifically. As shown in Fig. 9, we select the UDCP algorithm based on the physical model for comparison. We find that our method can effectively correct the color cast of objects and scenes, while the green background in the underwater image enhanced by UDCP is still not removed. As shown in Fig. 10, we select a large range of underwater images with a blue background to illustrate that objects are easily disturbed by environmental colors in the image style transformation. Here we select the Cycle-GAN algorithm based on learning for comparison. The results indicate that our method accurately restored the correct color of the object, while the underwater image enhanced by Cycle-GAN is affected by the color of the scene, making the object biased toward blue tone.

The colors of underwater images enhanced by FMAGAN are more realistic.

The underwater enhanced image of UDCP based on the physical model is greenish.

CycleGAN’s underwater enhancement map based on style transfer is biased towards the color of the environment itself.
Next, we qualitatively compare the enhancement results of FMAGAN with several state-of-the-art models. We consider five learning-based models: (1) GAN-based underwater color correction model (UGAN) [13]; (2) GAN-based underwater color correction model with gradient penalty (UGAN-P) [13]; (3) underwater image enhancement convolutional neural network model based on underwater scene prior (UWCNN) [16]; (4) Water-Net [21]; (5) Cycle-GAN [12]. They use the same settings as FMAGAN to train on the paired EUVP data set. We also compare two physics-based models: dark channel prior method for underwater images (UDCP) [39] and underwater image enhancement based on fusion (FUSION) [40]. Both models use a common test set to qualitatively evaluate, which contains 515 underwater images. The comparison of several samples is shown in Fig. 11.

Qualitative performance comparisons of FMAGAN using a learning-based approach: UGAN, UGAN-P, UWCNN, Water-Net, Cycle-GAN. Two physics-based models: UDCP and FUSION.
As shown in Fig. 11, in the underwater image enhancement model based on learning, the underwater image enhanced by the UGAN and UGAN-P models is still greenish. The underwater image enhanced by the UWCNN model is poor in color and contrast restoration. Secondly, although the Water-Net and Cycle-GAN models show similar enhancement effects of FMAGAN on underwater images, the FMAGAN model is more realistic in color restoration. And FMAGAN is slightly better than these two models in quantitative analysis. In physics-based underwater image enhancement, FUSION enhances images often appear overexposed due to high contrast. Some of the enhanced images also show a wide range of red shadows. The images enhanced by the UDCP model have shortcomings such as inaccurate color reproduction and greenish images. In addition, although UGAN and UGAN-P achieves better enhancement effects on underwater images, the images processed by them are over-saturated in color and easily affected by the colors of the surrounding environment in the process of enhancing the image. In general, FMAGAN performs better in enhancing the performance of underwater images without using prior parameters.
We consider peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [41], and underwater image quality measure (UIQM) [42] to quantitatively analyze the enhanced image. PSNR is the most common and most widely objective measurement method for evaluating image quality. The higher the value of PSNR we get, the better image quality of the image we get. Nonetheless, the objective evaluation method may be quite unreliable sometimes. The PSNR scores cannot be consistent with human’s subjective sensation of image quality. It is possible that an image with higher PSNR looks worse than one with lower PSNR. SSIM is also a widely used index for evaluating image quality and measures the similarity of images from brightness, contrast and structure respectively. SSIM is a value of 0-1. The larger the value of SSIM we get, the less distortion we get. SSIM is based on the assumption that human will extract structural information when looking at an image. To put it simply, SSIM is more conforming to human’s subjective judgment of image quality. Furthermore, we also used UIQM as an extra supplementation to evaluate the quality of enhanced images. UIQM is an index for evaluation of underwater image quality without reference based on human visual system excitation and evaluates the quality of underwater images from color, sharpness and contrast. The larger UIQM value of the image we get, the better color balance, sharpness and contrast of the image we get. These indexes are used together for evaluation of enhanced images, thus achieving more comprehensive evaluation of their quality. PSNR approximates the reconstruction quality of the generated image x relative to its ground truth y based on their mean square error (MSE), as shown below:
SSIM compared image blocks based on three attributes (brightness, contrast, and structure). It is defined as:
Given the degradation mechanism and imaging characteristics of underwater images, UIQM adopts Underwater Image Colorfulness Measure (UICM), Underwater Image Sharpness Measure (UISM) and Underwater Image Contrast Measure (UIConM) as the evaluation basis and expresses itself as a linear combination of the foregoing three indexes. It is defined as:
Among them, the three parameters were determined through multiple linear regression. In this paper, we used the parameter settings in [42], namely c1 = 0 . 0282, c2 = 0 . 2953, and c3 = 3 . 5753.
In Tables 3, we provide the average PSNR, SSIM, and UIQM values of FMAGAN on the test set, and compare the results with the same model used in the qualitative evaluation. In PSNR and SSIM evaluation, we use FMAGAN enhanced underwater images to compare with the clear underwater images in the test set. In Table 2, FMAGAN achieves the highest values in both PSNR and SSIM metrics, which shows that the image enhanced by FMAGAN is closest to the clear underwater image. Simultaneously, it also proves that FMAGAN is effective in enhancing underwater images. In UIQM evaluation, we use the FMAGAN enhanced underwater image to compare with the underwater distorted image in the test set. In Table 3, FMAGAN also achieves the best value in UIQM metrics. At the same time, the mean value of UIQM of underwater distorted images in the test set is also provided. Obviously, the UIQM value of FMAGAN is higher than the UIQM value of underwater distorted images. It indicates that the image quality is improved after FMAGAN enhancement.
Quantitative comparison of the average PSNR and SSIM values on the test images of the EUVP data set
Comparison of the average UIQM value of the images after enhancement of each model on the EUVP test set
In this paper, we proposed a generative adversarial network that integrated multiple attention to enhance underwater images. The method enhanced underwater images by learning the style features of images without distortion and transferring these features to underwater images. The module(CSAM) that integrated multiple attention improved network performance, and multiple loss functions generated better quality underwater enhanced images. Meanwhile, we did a series of experiments to verify the effectiveness of the model. The results showed that our proposed model was superior to several existing underwater enhancement models. In addition, we also conducted ablation studies to show the contribution of each component, which further validated the effectiveness of the model.
We found that the self-attention structure greatly increased the computational complexity of the model. In the future, we hope to use a lighter structure to improve the running speed of the model. We are also trying to apply the model to unpaired training.
Footnotes
Acknowledgments
This work was supported by the Real-time Underwater Specific Target Autonomous Recognition Project (2019750001).
The author would like to thank the supervisor for his careful guidance and thank all participants who made constructive comments on the paper.
Disclosures
The authors declare no conflicts of interest.
