Abstract
A hazy image is characterized by atmospheric conditions that reduce the image’s clarity and contrast, thereby making it less visible. This degradation in image quality can hinder the performance of advanced computer vision tasks such as object detection and identifying open spaces which need to perform with high accuracy in important real world applications such as security surveillance and autonomous driving. In the recent past, the use of deep learning in image processing tasks have shown a remarkable improvement in performance, in particular, Convolutional Neural Networks (CNNs) perform superior to any other type of neural network in image related tasks. In this paper, we propose the addition of Channel Attention and Pixel Attention layers to four state-of-the-art CNNs, namely, GMAN, U-Net, 123-CEDH and DMPHN, used for the task of image dehazing. We show that the addition of these layers yields a non-trivial improvement on the quality of the dehazed images which we show qualitatively with examples and quantitatively by obtaining PSNR and SSIM scores of 28.63 and 0.959 respectively. Through the experiments, we show that the addition of the mentioned attention layers to the GMAN architecture yields the best results.
Introduction
A hazy image can be subjectively characterized as an image in which visibility is reduced due to a loss of color and contrast. Hazy images result from various factors, including smoke, dust, fog, water droplets, and other substances, causing light to disperse within these mediums. Dehazing is the process of restoring the original color and contrast of an image as it would appear under normal conditions. It is an important task for the following reason. The presence of haze can significantly impair the performance of various tasks in computer vision, such as detection and identification of objects, semantic segmentation, low-light image enhancement, and free space detection. The accuracy in performance of these tasks are of paramount importance in applications such as security surveillance, autonomous driving, agricultural monitoring, etc. Therefore, there has been a surge in the amount of work done in trying to reconstruct clear images from the original hazy ones.
The task of image dehazing is exceptionally challenging due to the inherent difficulty in obtaining real-world data. Consider a foggy scene: once a hazy image is captured, retaking the ground truth image under normal conditions becomes nearly impossible. This is due to potential moving objects and external factors. Moreover, collecting hazy images of various distributions, such as those caused by smoke or fog, adds further complexity to the task. Current state-of-the-art techniques for image dehazing need to be further improved to perform computer vision tasks more accurately.
A mathematical formulation of the image dehazing task can be shown through the atmospheric scattering model [1, 2], as represented by Equation 1. In this equation, I(x) represents the hazy image captured by the camera, while A and t(x) denote the global atmospheric light and the transmission map, respectively. J(x) represents the scene radiance that needs to be restored.
Traditional methods to dehaze an image involved making certain observations about images in general and extending this assumption to all images. For example, [3] observed that in most clear images, there would exist one or more patches where several pixels in these patches had pixel values of zero or close to zero in atleast one colour channel. This is known as Dark Channel Prior (DCP). DCP was used to predict the transmission map to obtain a clear image, however, several artifacts were also produced in the clear image, likely due to the lack of precision in estimation of the transmission map. [4] and [5] tried to improve the DCP using boundary constraints and larger variation respectively. This solved the problem of arising artifacts, however, new problems persisted such as distortion of colours in sky regions. The above rule-based methods offered a clear processing flow and interpretability but often fell short in complex real-world scenarios. For these reasons, machine learning and deep learning have seen a rapid rise in popularity to develop state-of-the-art models to dehaze images.
The term machine learning indicates that a machine learns automatically from data by identifying and analysing patterns and improves performance as the amount of data it sees increases without having to be programmed explicitly. Deep learning is a subset of machine learning that constructs a network of algorithms known as neural network that aims to replicate the functioning of the human brain. It is capable of learning from extremely large amounts of data, can mimic any complex function, and is robust to any format of data on the input and output side. Some types and components of neural networks that are referred to in this paper are defined in the following paragraph.
Generative Adversarial Networks (GANs) are a type of neural network that generates an output from scratch. The quality of the generated outputs are improved by using a model to compare the generated output to a real output iteratively until the model can no longer tell the difference. Convolutional Neural Networks (CNNs) are a type of neural network that is highly capable of learning features of the input, for example, the shape of an object in a particular area of the input image. CNNs have shown to perform superior to any other type of neural network in image, speech, and audio signal input related tasks. Attention mechanisms consist of one or more layers in a neural network that ensure all parts of the input are given the required amount of emphasis. For example, in a scenic image of a playground, a deep learning model may not capture some children playing in a corner of the image, attention mechanisms are helpful here.
Deep learning approaches to image dehazing may involve developing models that estimate the transmission map [3, 6–9], using Generative Adversarial Networks (GANs) to generate the dehazed image from scratch [10–17], and using end-to-end CNNs to extract features from individual parts of the images for better representation [18–24]. The current state-of-the-art models are all CNN based models that are capable of dehazing an image almost perfectly in many situations. In this paper, we provide ways to improve upon these models through the use of attention layers to produce clearer images in a wider array of situations and obtain better Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) scores (described in detail in later sections).
In this section, we provide an overview on the current deep learning approaches to image dehazing.
Estimation of t(x)
The Atmospheric Scattering Model (ASM) breaks down the process of dehazing into three components, namely, the estimation of the transmission map, the prediction of atmospheric light, and the recovery of the haze-free image. In the case of Multi-Scale Convolutional Neural Network (MSCNN) [6], it introduces a three-step approach to address the ASM: Utilize Convolutional Neural Networks (CNN) to predict the value of the transmission matrix, which is denoted as t(x). Employ statistical rules to estimate the atmospheric light, referred to as A. Calculate J(x), the clear image, by jointly considering t(x) and A.
A convolutional model of multiple scales is utilized to estimate the transmission matrix in the MSCNN model and it is further optimized using L2 loss.
A convolutional model of multiple scales is utilized to estimate the transmission matrix in the MSCNN model and it is further optimized using L2 loss. Furthermore, the atmospheric light A can be determined by selecting the darkest 0.1% of pixels from t(x) that correspond to the largest luminosity in I(x) [3]. Consequently, the dehazed image J(x) can be obtained using Equation 2.
Various research papers may employ different statistical priors to estimate the atmospheric light (A); however, their dehazing strategies bear similarities to MSCNN. For instance, ABC-Net [7] obtains the highest value from each luminosity channel using the max pooling operation. In contrast, SID-JPM [8] uses the three channels in the input image (RGB) using a minimum filter kernel and the highest value from each channel is taken as the predicted value of A. LATPN (Learning Aggregated Transmission Propagation Network) also [9] combines the minimum and maximum filter in predicting A. These methods typically do not necessitate explicit annotations for atmospheric light but do require the combination of corresponding hazy images and their transmission matrices for their operation.
Generative adversarial networks (GANs) have played a significant role in advancing dehazing research. For neural networks that take a supervised approach to image dehazing, the adversarial loss component is also considered. This adversarial loss [10] can be conceptually divided into two components: first, the generator is trained to produce images that the discriminator differentiate between the generated images and the real images to the maximum possible effectiveness. For image dehazing, the adversarial loss aims to ensure that the generated images look as real as possible and cannot be differentiated with the real image, which proves advantageous for estimating the dehazed image J(x) and transmission matrix t(x) [11].
Drawing inspiration from the patchGAN [12], which excels in preserving high-frequency details, approaches like DH-GAN [13], RI-GAN [14], and DehazingGAN [15] adopt the use of image patches as the discriminator’s output. Several studies delve into mechanisms with several discriminators, as shown in EPDN (Enhanced pix2pix Dehazing Network) [16] and PGC-UNet [17]. In this setup, discriminator D1 guides the generator in fine-scale details, while discriminator D2 assists in producing a globally realistic output on a coarser scale.
End-to-End CNN
This involves neural networks that take a hazy image as input and produces a clear image as output. The layers in this type of neural network are convolutional layers or layers that come under that components of convolutional neural networks. The FFA-Net (Feature Fusion Attention Network) architecture [18] comprises three integral components. Firstly, there is a novel Feature Attention (FA) module that blends Channel Attention (CA) with Pixel Attention (PA). This fusion accounts for the varying importance of features in different channels and acknowledges the unbalanced distribution of haze across distinct image pixels. FA, by treating features and pixels differentially, enhances the adaptability of CNNs in handling diverse information types, thereby expanding their representational capabilities. Secondly, a fundamental block structure is incorporated to filter out less significant information. Thirdly, an attention-based feature fusion structure enables the adaptive learning of feature weights from the FA module. The FFA-Net architecture can be seen in Fig. 1.

FFA-Net architecture [18].
The GridDehazeNet architecture [19] consists of the following modules: pre-processing, backbone, and post-processing. This pre-processing module is capable of capturing the varying and relevant features in the input image better than hand-picking the pre-processing method as it is a trainable module. The backbone module uses attention blocks across multiple scales in a grid-type network, effectively mitigating the bottlenecks one may face in standard methods of a multi-scale approach. Lastly, the post-processing module aids in reducing artifacts in the final output.
The architecture based on U-Net, as described in [20], integrates Squeeze-and-Excitation (SE) blocks into the skip connections to emphasize attention for each channel. Additionally, it incorporates parallelized dilated convolution blocks within the bottleneck, enabling the model to grasp local and global context. This combined approach results in a more comprehensive representation of image features. The U-Net architecture can be seen in Fig. 2.

U-Net architecture [20].
The Multi-Scale Boosted Dehazing Network with Dense Feature Fusion [21] is derived from the U-Net architecture and introduces two fundamental principles, boosting and error feedback, specifically tailored for addressing challenges of image dehazing. Incorporating the Strengthen-Operate-Subtract boosting strategy within the decoder allows this method to construct a decoder to produce the clear image step-by-step. The U-Net architecture may not effectively be able to preserve spatial information and hence, a dense feature fusion component is designed. This module leverages a back-projection feedback scheme to simultaneously rectify spatial information that is absent in sharp features and harness non-adjacent feature information.
While numerous network architectures have been devised for the restoration of dehazed images, a specific focus on the recovery of individual color channels, ensuring their high quality, has been somewhat overlooked. Addressing this concern, the 123-CEDH (Color Enhancement Dehazing) network [22] introduces an innovative network structure. This structure includes a shared feature encoder based on DenseNet, the output of which diverges into three separate decoders, producing estimations of the three color channels (RGB) within the image. Furthermore, a refinement block enhances the produced image through the coordinated treatment of these channels. To guarantee the retrieval of meaningful and superior-quality color channels, the loss function is augmented with regularization terms. These include a structural similarity index term and a color contrast enhancing term. The 123-CEDH architecture can be seen in Fig. 3.

123-CEDH architecture [22].
While CNN-based end-to-end deep learning methods have demonstrated their effectiveness in image dehazing, they often struggle with non-homogeneous dehazing challenges. Additionally, widely used multi-scale approaches tend to consume significant runtime resources and memory. In light of these issues, the authors of [23] introduce a faster solution named the Deep Multi-patch Hierarchical Network (DMPHN). DMPHN is designed to work on hazy images where the distribution of haze is not uniform by considering multiple image patches extracted from different parts of the hazy image, all while employing a reduced number of network parameters. This method exhibits robustness across diverse environmental conditions, accommodating varying densities of haze. Furthermore, the model is extremely lightweight. The DMPHN architecture can be seen in Fig. 4.

DMPHN architecture [23].
The Generic Model-Agnostic convolutional Neural Network (GMAN) [24] is a versatile model designed without specific considerations for image dehazing, notably avoiding reliance on the atmospheric scattering model. GMAN operates by taking a hazy image as input and generating its clear counterpart as output. Its architecture adopts the well-known encoder-decoder structure, comprising three main components: an encoder, a residual layer, and a decoder. The input image is transformed and downsampled by the encoder, which is then passed through the residual layer where pertinent features are extracted. These features guide the decoder in producing an approximation of the wanted output, incorporating upsampling and transformation. The combination of downsampling and upsampling helps preserve essential features while discarding unnecessary ones, contributing to the network’s robust generalization performance. Residual learning is applied at both local and global levels within GMAN. Local residual layers are formed using residual blocks in the middle and immediately after downsampling stages. This approach leverages the well-documented ease of training with residual blocks. Moreover, residual learning is integrated into the model’s overall architecture. The input image and the output from the final convolution layer are combined, creating a global residual block. This block aims to make up for any loss in information in the intermediary steps. The GMAN architecture can be seen in Fig. 5.

GMAN architecture [23].
In the current state-of-the-art models, we find that these models either struggle to remove the haze completely or produce colour-inaccurate dehazed images in situations where the haze tends to be more focused in a particular region or in situations where there is a distinct sharp object being clouded by the haze. While many of these models have used attention layers, there has been more emphasis on the rest of the model architecture. In the following sections, we describe how using certain attention blocks in existing state-of-the-art models can significantly improve their performance.
Dataset
In accordance with the atmospheric scattering model, the degree of haze is determined by the transmission map denoted as t(x) and the intensity of atmospheric light represented as A. Ensuring the proper configuration of the degree of haze and the intensity of atmosoheric light is critical when constructing a dataset of synthetically generated hazy images. In this paper, we have employed the Outdoor Training Set (OTS) and the Indoor Training Set (ITS) available in RESIDE [25] to train the model. The OTS comprises a substantial collection of 313,950 hazy images with varying degrees of haze corresponding to 8,970 clear outdoor images. The training dataset consists of 278,950 hazy images (corresponding to 7,970 clear images). Similarly, the ITS contains 13,390 synthetic hazy images, created in a similar manner to the process employed for the OTS, using 1,339 clear indoor images as the source data.
Preprocessing
The image is read through as a colour image (i.e. through three channels, R, G and B) and is resized to 412×548 pixels. This is done as the size of the input to the deep learning model is fixed here. We find that this size strikes the right balance between maintaining the quality of the image which enabling the model to run with minimum parameters without compromising on the performance. Preprocessing of the image is essential as it allows the model to take any hazy image as input without any restrictions to its size.
Model architectures
In an image where there is a lot of information to capture, some key parts may be ignored. In the task of image dehazing, it is critical that each part of the image is completely dehazed and is restored to the appropriate colour contrasts and brightness. Attention mechanisms can be used here to improve the performance of some state-of-the-art models. In particular, we propose the use of the attention mechanism drawn from FFA-Net [18] which uses a separate channel attention and pixel attention. Many image dehazing networks typically treat channel-wise and pixel-wise features equivalently, which can be limiting when handling images with non-uniform haze distribution and weighted channel-wise features. The Feature Attention module encompasses both channel attention and pixel attention, introducing greater flexibility in handling diverse types of information. Channel attention addresses the variances in weighted information across different channel features. On the other hand, pixel attention comes into play when the haze distribution varies across different image pixels, prompting the network to focus more on informative features, such as densely hazy pixels and high-frequency image regions.
Below, we describe the architecture of the following four state-of-the-art models and the changes we make to it to improve its performance, which is done through the use of channel attention and pixel attention blocks. The rationale behind selecting these four architectures is that they are the current state-of-the-art architectures and they further handle different problems in image dehazing such as colour enhancement and speed of convergence. We refer to the attention blocks as CAPA (Channel Attention Pixel Attention).
GMAN + CAPA
This architectural design combines the attention mechanisms borrowed from FFA-Net [18] with the encoder-decoder framework originating from the GMAN model [24]. GMAN consists of an encoder-decoder architecture wherein the encoder has a downsampling factor of 2, the same as the upsampling factor of the decoder. The initial two layers have a 64-channel convolution block, the down-sampling part reduces the image to a 56×56×128 volume, the residual layer consists of four residual blocks which represent the transition from encoding to decoding. Following this is an upsampling layer which constructs a 64×64× 256 volume which is added to the input image to obtain the dehazed image. The attention layers are incorporated after the encoder stage to add context to the information being passed on to the decoder.
U-Net + CAPA
This architectural configuration merges the attention mechanisms drawn from FFA-Net [18] with the encoder-decoder structure inspired by U-Net [20]. The U-Net architecture can be divided into three fundamental segments: contraction, bottleneck, and expansion. In the contraction phase, a series of multi-scale blocks is formed, with every block comprising two convolutional layers and a max pooling layer. After every block, the the dimensions of the feature map are halved and the number of feature maps is doubled. These encodings are directed to a parallelized dilated module within the bottleneck segment, designed to grasp local and global features. The results of the dilated convolutions are concatenated and propagated to the expansion phase. This phase mirrors the contraction phase except that the downsampling blocks are replaced by the upsampling blocks. In this section of the model, channel attention and pixel attention blocks are introduced to assign weight to the input feature maps based on their respective significance.
123-CEDH + CAPA
This architecture combines the attention mechanisms from FFA-Net [18] with the structural concept found in the 123-CEDH network [22], which involves a single encoder and three distinct decoders. Here is a description of this architectural approach: The encoder functions as a general-purpose feature extractor, while the purpose of the decoders is to reconstruct the three channels of the image based off the encoding. The concept of building multiple decoders from the output of a single encoder has proven to be potent in various vision-related tasks, including denoising, surface normalization, unsupervised 2D segmentation, and others [26]. A common encoder ensures that spatial and color information from each channel is carried forward to all the decoders. Subsequently, each of the three colours can be accurately reconstructed through the specialized decoders. A refinement block is added at the end to capture information across the three channels. The channel attention and pixel attention layers are introduced after the encoder layers and are directed to each of the three decoders.
DMPHN + CAPA
This architectural approach integrates the attention layers from FFA-Net [18] with the encoder-decoder design found in DMPHN [23]. DMPHN operates as a multi-level architecture, where each level comprises an encoder-decoder pair, and each level handles a different number of patches progressing from the top to the bottom levels. The topmost level focuses solely on a single patch for each input image. A vertical division of the image into two patches occurs at the next level. At the bottommost level, horizontal division occurs thus giving rise to four patches. A bottom-up flow of information occurs in the model, with patches from the lowest level being processed by upper encoders to generate corresponding feature maps. Notably, the DMPHN architecture is exceptionally lightweight, with a checkpoint size of only 21.7 MB. This attribute facilitates much faster training and inference while maintaining performance levels or experiencing minimal degradation. In this architecture, channel attention and pixel attention layers are introduced following the concatenation of each encoder block.
Experimentation and results
In this section, we provide results to show how adding the right attention blocks to existing state-of-the-art CNN architectures can lead to an improvement in the quality of dehazed images. The proposed models are described in the previous section (GMAN + CAPA, U-Net + CAPA, 123-CEDH + CAPA, DMPHN + CAPA) and it is evaluated against their original CNN counterparts (GMAN, U-Net, 123-CEDH, DMPHN). We evaluate the models qualitatively on random images from the test images of the outdoor dataset from the RESIDE dataset and quantitatively from the Outdoor Training Set of the RESIDE dataset. First, we provide examples of dehazed images from all the considered models of the same hazy input and compare their results. Next, we take a closer look at some examples of dehazed images from our models and compare it only with their CNN counterparts. Finally, we run the models against some quantitative metrics.
Figures 6–8 depict the dehazed images from four state-of-the-art CNN architectures (GMAN, U-Net, 123-CEDH, DMPHN) and the four models we propose (GMAN + CAPA, U-Net + CAPA, 123-CEDH + CAPA, DMPHN + CAPA) from the same input. Figure 6 is of a simple scenery with a river and a building. In each case, it can clearly be seen that our model dehazes the image better than its CNN counterpart. Further, the dehazed image closest to the ground truth would be our model in 6(g) which is the 123-CEDH + CAPA architecture. Figure 7 is of a traffic filled road with several sharp features such as cars and cyclists. Once again, we can clearly see that our models outperform their CNN counterparts. The GMAN model (7(b)) produces a slightly darker image which is fixed by adding the CAPA blocks (7(c)).

Comparison of the dehazed images from various dehazing models discussed in this paper against the same input. (a) is the input hazy image and (j) is the ground truth image.

Comparison of the dehazed images from various dehazing models discussed in this paper against the same input. (a) is the input hazy image and (j) is the ground truth image.
Figure 8 is an example of a darker image, possibly taken at the time of sunset. It is consists of a road which is an example of the input in autonomous driving vehicles. The existing state-of-the-art models struggle to remove the haze completely or end up with images that are too dark. Our proposed models produce better dehazed images with a significant reduction of the above issues.

Comparison of the dehazed images from various dehazing models discussed in this paper against the same input. (a) is the input hazy image and (j) is the ground truth image.
A closer comparison between our proposed models and their CNN counterparts is made in Figs. 9–12. It can be clearly seen that adding a relevant attention layer to state-of-the-art architectures can increase the quality of the dehazed image. Without the attention layer, the images produced either contain some haze, do not reflect the original colours appropriately, are slightly blurred, etc.

Comparison of results between GMAN [24] and our model (GMAN + CAPA).

Comparison of results between U-Net [20] and our model (U-Net + CAPA).

Comparison of results between 123-CEDH [22] and our model (123-CEDH + CAPA).

Comparison of results between DMPHN [23] and our model (DMPHN + CAPA).
To quantitatively compare our models to existing ones, we use two metrics on the Outdoor Training Set from the RESIDE dataset consisting of the input hazy image and the ground truth clear image. The first metric is the Peak Signal to Noise Ratio (PSNR). PSNR is mathematically taken as the ratio between the maximum signal value in the clear image and the mean square error between the dehazed image and the clear image. As the PSNR value increases, the quality of the dehazed image is also said to be better. The equation to calculate the PSNR score between two images is given in Equation 3.
The second metric is the Structural Similarity Index Measure (SSIM). SSIM is calculated by comparing the dehazed image and the clear image in three components; luminance (measured by the average intensity of the pixels of the image), contrast and structural. Structural similarity implies that pixels have strong inter-dependencies especially when they are spacially close. SSIM is different from PSNR in the sense that it gives the absolute error between the dehazed image and the clear image. The higher the SSIM, the better the quality of the dehazed image. The equation to calculate the SSIM score is given in Equation 5.
It can be seen in Table 1 that for each state-of-the-art CNN model, our modifications improve the PSNR score. The variation in PSNR and SSIM values is minimal between GMAN [24] and GMAN + CAPA as well as U-Net [20] and U-Net + CAPA. However, for 123-CEDH [22] and DMPHN [23], the PSNR and SSIM scores significantly improve when the attention blocks are added.
PSNR and SSIM values on the Outdoor Training Set of the RESIDE dataset by various models described in this paper
These improvements in the quality of dehazed images will play an important role in further computer vision related tasks such as autonomous driving and agricultural monitoring. Areas such as autonomous driving will be able to benefit from the clearer images to improve safety standards whereas areas such as agricultural monitoring will help increase the revenue to farmers.
While the introduced attention layers improve the performance as seen above, they significantly increase the training time and inference time. Further, in indoor settings, the models tend to lose the contrast in colour between different parts of the image.
In this paper, we introduce the concept of image dehazing along with a view on the state-of-the-art CNN architectures to dehaze an image. We then show how using two attention blocks, namely the pixel attention block and the channel attention block, from the FFA-Net architecture can be used to improve the performance on these state-of-the-art models such as GMAN, U-Net, 123-CEDH and DMPHN and provide experimental results to show the same. Our experiments provide visual examples of how the quality of dehazed images by our models are better than existing models as well as quantitative metrics, i.e. PSNR and SSIM scores, to prove that they are an improvement over existing models.
This paper opens up a large spectrum for future work. Using attention layers to enhance the performance of other types of deep learning models such as GANs can be tried. Further, training on larger and more robust datasets such as NH-Haze or NTIRE with larger models could improve the performance.
