SelectiveFusion: Channel attention-driven infrared-visible image fusion network

Abstract

During the process of capturing infrared and visible images, the presence of inconsistent illumination poses a challenge that adversely affects the visual quality of the fused paper. In this paper, we propose a novel method for infrared and visible fusion, termed SelectiveFusion, which effectively addresses the issue of inconsistent illumina-tion in the process of capturing infrared and visible light images. This method consists of three key elements: en-coder, fusion strategy, and decoder. Firstly, the source images are fed into the encoder to extract multi-scale deep features. Subsequently, a new fusion strategy is employed to merge features from each scale. In our fusion strate-gy, we develop a selective channel attention fusion module that allows for selective channel weighting of the dif-ferent input features from infrared and visible image. Finally, the fused features are subjected to feature recon-struction through a nested decoder. Additionally, we formulate a novel loss function to guide the training of the fusion network. Our experiments were conducted on publicly available datasets, and compared to existing methods, both quantitatively and qualitatively, demonstrating the effectiveness and versatility of SelectiveFusion. Our code is publicly available at [https://github.com/ISCLab-Bistu/SelectiveFusion].

Keywords

infrared and visible image image fusion channel attention

1 Introduction

Due to the limitations of hardware imaging equipment, images obtained from single-modal sensors are unable to present complete and detailed scene information. For example, in poor lighting conditions, images acquired by visible light sensors struggle to effectively differentiate objects from the background.¹ Therefore, the emergence of image fusion technology addressed this issue.² It refers to the complementary integration of images captured by different modal sensors in the same scene, resulting in a fused image.³ The fused image exhibits superior scene representation and visual perception capabilities, making it suitable for various subsequent visual tasks.⁴

Over the past few decades, researchers have proposed a plethora of algorithms for the fusion of infrared and visible images. These algorithms can be broadly categorized into two types: traditional methods⁵ and deep learning-based methods.⁶ Traditional image fusion methods require activity level measurements in either the spatial or transform domain, and manual design of fusion rules⁷ to achieve image fusion. A typical approach involves applying mathematical transformations to convert source images into a transformed domain,⁸ followed by the designing of fusion rules within this domain. Examples of such methods include fusion frameworks based on multi-scale transforms, sparse representation, and saliency-based fusion frameworks.⁹

While traditional image fusion algorithms have achieved decent results, they still suffer from issues such as feature extraction methods lacking generality and the complexity of manual fusion strategy design.¹⁰ As a result, traditional image fusion methods are facing developmental bottlenecks.¹¹ In recent years, the thriving development of deep learning has led to the emergence of numerous deep learning-based algorithms in the field of image fusion. These algorithms leverage the powerful feature extraction and representation capabilities inherent in deep learning, giving them a significant advantage over traditional methods.¹² As research on deep learning-based image fusion methods has progressed, they can currently be classified into four main categories: those based on autoencoders, convolutional neural networks, generative adversarial networks, and transformer-based image fusion frameworks.¹³ Presently, deep learning-based fusion algorithms have become the dominant trend in the field of image fusion. Introducing these fusion techniques into an end-to-end framework¹⁴ can further optimize the training process and yield superior image fusion results.

While algorithms based on deep learning have shown significant advancements compared to traditional methods, we have observed that existing deep learning-based image fusion methods still have limitations when dealing with the issue of uneven input image quality, which is common in practical applications. Especially during the acquisition of infrared and visible images, significant illumination imbalances arise due to varying lighting conditions,¹⁵ such as daytime, nighttime, or adverse weather. This poses a significant challenge for effectively fusing high-quality visible images and low-quality infrared images. Existing methods often struggle to distinguish and effectively utilize the different quality characteristics of the two modalities, potentially leading to fusion results being compromised by low-quality images and failing to fully leverage the advantages of high-quality images.¹⁶ Therefore, selective feature fusion in situations with uneven input image quality is a critical problem that urgently needs to be addressed in the field of image fusion.

Uneven input image quality is a significant challenge for existing deep learning-based fusion methods, particularly in infrared-visible fusion under varying illumination. This issue hinders the effective utilization of different modality characteristics, potentially compromising fusion results. Addressing selective feature fusion in such scenarios is critical. We propose SelectiveFusion, an end-to-end method for effective multi-scale feature fusion and quality-based selective fusion, maximizing complementary information. Our contributions are:

(1)
An end-to-end image fusion framework comprising an encoder, fusion network, and decoder. It leverages global and local features, with the encoder extracting multi-scale features, the fusion network performing selective fusion at the feature level, and the decoder reconstructing the image. This adaptively learns the fusion strategy for superior complementary feature integration.
(2)
A selective channel fusion module, SCFusion. It employs a learnable attention mechanism to adaptively weight channels based on visible and infrared quality differences. By explicitly modeling quality, it enhances high-quality modality contributions and suppresses low-quality noise, effectively extracting and fusing complementary features and improving robustness in uneven illumination.
(3)
A loss function based on complementary information. It guides the model to preserve both overall quality and complementary texture details. This is achieved by explicitly encouraging the network to capture the unique details visible in each modality where the other is less informative, yielding richer textures and clearer details than traditional losses.
(4)
Comprehensive evaluation on benchmark datasets demonstrates SelectiveFusion's significant competitive advantages across multiple metrics compared to existing methods, fully verifying the framework's effectiveness and advancements.

2 Related work

In this section, we will categorize image fusion into two types: traditional methods and deep learning-based image fusion methods. We will provide detailed explanations of each in the following subsections.

2.1 Traditional methods

In the field of image fusion, image transformation and feature fusion are key aspects addressed by traditional algorithms. Traditional algorithms can be broadly categorized as follows:

Transform Domain Methods: These methods perform image fusion by transforming images into specific domains and then applying fusion rules.¹⁷ Examples include transform methods such as Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), and Curvelet Transform (CVT). Scale Transform Fusion: This category approaches fusion by decomposing source images into multi-scale representations using pyramid techniques and performing fusion based on rules. Subspace-based Methods: These techniques aim to reduce redundancy and extract key information by projecting images into lower-dimensional spaces. Examples include Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Non-negative Matrix Factorization (NMF). Sparse Representation Fusion Methods: These methods achieve fusion by sparsely encoding image blocks using learned dictionaries and reconstructing the fused image. Sparse Representation (SR) is a common method in this category. Saliency-based Methods: These methods focus on extracting salient regions from source images using saliency models and reconstructing the fusion image based on these salient features.

However, in the field of image fusion, traditional methods face challenges such as limited adaptability to complex scenes and are sensitive to parameter settings. This limits the versatility and flexibility of traditional fusion approaches, leading to increasingly complex and challenging design processes.¹⁸ As a result, the complexity and difficulty of traditional fusion methods have been on the rise.

2.2 Deep learning-based image fusion method

With widespread applications and rapid developments of deep learning in computer vision tasks, some limitations of traditional methods have been addressed. Deep learning-based methods can leverage large-scale data for training, enhancing the model's generalization ability, and demonstrating excellent performance on images of different scales. Currently, the most widely used approach is the image fusion framework based on auto-encoders. Simultaneously, various deep learning algorithms have been extensively applied in the field of image fusion. These include the utilization of convolutional neural networks, generative adversarial networks, and transformer-based image fusion frameworks. Inspired by its success in computer vision, Transformer-based fusion methods leverage self-attention mechanisms to model long-range dependencies and facilitate effective feature integration. The rapid progress in the field of deep learning has also led to a sharp increase in the performance of image fusion.¹⁹

For instance, Li et al. proposed an encoder-decoder model named RFN-Nest.²⁰ It employs an encoder network to preserve features and a decoder to reconstruct the fusion result. Additionally, a two-stage training strategy is employed to achieve better fusion results for different source images. Prior to this, Li et al. proposed DenseFuse,²¹ a novel deep learning architecture for image fusion. Ma et al. proposed FusionGan,²² a generative adversarial network, to address the problem of the fusion of infrared and visible light images. This method retains most of the infrared pixel-level information and gradient features from the visible light image. Ma et al. also proposed a fusion network called U2Fusion,²³ which can be applied to various fusion tasks. The Elastic Weight Consolidation algorithm can enable a single model to address different fusion tasks without the need for weight decay. Furthermore, transformer-based approaches like SwinFusion,²⁴ YDTR,²⁵ MixFuse²⁶ and CrossFuse²⁷ have been applied in the field of image fusion, leveraging the popularity of transformer technology in computer vision. There are also novel fusion algorithms such as MFEIF,²⁸ which designs an edge-guided attention mechanism on multiscale features to attenuate noise while restoring details. STDFusionNet²⁹ extracts salient target features to assist in image fusion, whereas LRRNet,³⁰ based on low-rank representation, forms the foundation of a highly lightweight fusion network, with matrix multiplication as the core solution. TextFusion³¹ introduces a novel paradigm for controllable image fusion, leveraging coarse-to-fine association and an affine unit to guide multi-modal feature fusion.

3 Proposed method

This section will provide a detailed introduction to the fusion network framework we proposed. It covers the network model's architecture, the encoder-decoder structure, the SCFusion fusion module, the two-stage training approach, and the loss function design.

3.1 The architecture of the fusion network

The architecture of the fusion network is shown in Figure 1 below. Given the input source images, the framework proceeds through three parts: feature extraction, feature fusion, and feature reconstruction, ultimately producing the fused image. The feature extraction component comprises a multiscale feature extraction network consisting of four encoder blocks. It conducts shallow-to-deep level feature extraction on the input source images. The feature fusion segment is composed of four identical SCFusion fusion modules. SCFusion employs the same fusion approach across the four different scales obtained from feature extraction, resulting in richer and more accurate fusion information. The feature reconstruction component consists of three layers of decoder blocks, remapping the fused features back into the image space to generate a new composite image. This image comprehensively utilizes information from both the infrared and visible light images, leading to a synthesized image with enhanced visual effects and information integration.

Figure 1.

The framework of the proposed SelectiveFusion, consisting of feature-extracting encoder block, fusion-oriented SCFusion block, and feature-reconstructing decoder block.

The specific implementation process involves the following steps: First, the visible and infrared images are inputted into the feature extraction network. The features of the image pair are extracted using pre-trained weights from the one-stage model. The features of the visible and infrared images at the same scale are then fed into the SCFusion fusion module for feature fusion. Finally, the fused features are passed through a bottom-up feature reconstruction network to generate the reconstructed image. The weights of the feature reconstruction network are also pre-trained beforehand.

3.2 Encoder-decoder architecture

The entire network model consists of three components: feature extraction, feature fusion, and feature reconstruction, as shown in Figure 2 underneath. First, we will describe how to implement the encoder and decoder parts for feature extraction and reconstruction. These two parts are trained simultaneously in the first stage and their weights remain unchanged in the overall architecture when we introduce the feature fusion network. These together form our one-stage training. In the first stage, the encoder and decoder parts are trained to develop the capability to extract features at multiple scales and reconstruct multi-scale deep features.

Figure 2.

One-stage training procedure composed by the encoder-decoder architecture.

The Encoder part processes the input visible and infrared images through a 1 × 1 stem convolution, introducing an initial feature representation for the multi-scale encoder network. It then undergoes three down-sampling operations composed of dense blocks and max-pooling, obtaining four different scales of features. Each dense block consists of a 3 × 3 convolution and a 1 × 1 convolution. The output channels of the first convolution are half of the input channels, while the second convolution restores them.

The structure of each block in the Decoder part is identical to that in the Encoder part. However, the bottom-up fusion is inspired by the design of Unet ++.³² Each deep-level feature is upsampled to the same scale as the previous layer and fused through feature concatenation. Low-level features primarily focus on the basic edge and texture information in the image, while high-level features pay more attention to abstract shape and structural semantic information. By progressively fusing low-level and high-level features, the fusion model gains the ability to perceive and understand features at different scales, enabling global perception and discrimination of multi-scale images. Additionally, blocks at the same level are fused with each preceding block through skip connections. This allows the model to learn global feature information at the same scale, fuse semantic information from different layers, and enhance the model's perception and feature representation capabilities. Finally, a 1 × 1 convolution is applied to produce a more information-rich and visually appealing reconstructed image.

The one-stage training model architecture consists of both the encoder and decoder parts. The model structures for these two parts correspond entirely to the sections mentioned earlier in the SelectiveFusion overall training architecture. However, we first conduct training on this section to obtain the weights, which are then utilized in the second stage to train the SCFusion Block. Finally, during inference, we use the one-stage trained encoder, the two-stage trained fusion module, and the one-stage trained decoder to achieve the fusion of infrared and visible images.

The Encoder and Decoder parts are simultaneously trained as part of the one-stage training framework. This design is aimed at achieving better feature extraction and reconstruction capabilities. The COCO2014 dataset is used for training. After extracting features at four different scales in the encoder part, a feature loss function is computed using the encoded features of the input images. In the decoder part, after completing feature reconstruction, a structural loss function is calculated using the input encoded features. These two loss functions will be detailed in the following section on loss functions. The entire loss function for the one-stage network is composed of these two components. It's worth noting that once the weights of the one-stage network are obtained, they will only be used for assisting in the training of the second-stage fusion module, which will be described in the subsequent discussion. During the training of the fusion module, the weights of the one-stage framework remain unchanged.

3.3 SCFusion block

The SCFusion fusion module is predicated on the principle of merging and reweighting feature channels within the fused feature map, thereby enabling effective fusion through the discrimination and weighting of feature maps originating from distinct modalities. Furthermore, the residual pathway preserves global features of the fused image, resulting in an enriched representation of critical fusion features. The SCFusion module is primarily realized through three operators: Fuse, Select, and Residual. The specific details are illustrated in Figure 3 below.

Figure 3.

The architecture of SCFusion block.

Fuse: To effectively integrate feature representations from the visible and infrared modalities, we adopt the fundamental idea of a gating mechanism to control the information flow between the visible and infrared branches. During the Fuse stage, we perform preliminary fusion and information refinement of the features to prepare for the adaptive feature selection in the Select stage. Specifically, we first conduct element-wise addition of the features extracted from the two modalities by the feature extraction module to obtain the initial fused features.

\begin{aligned} F = F_{v i} + F_{i r} \end{aligned}

(1)

where

F_{v i}

denotes the encoded features from the visible modality,

F_{i r}

represents the encoded features from the infrared modality, and F is the fused feature map from the two modalities; all three have the dimensions

H \times W \times C

. To capture the global information of the fused feature map F along the channel dimension and to provide input for the subsequent channel attention mechanism, we further apply global average pooling to the initial fused feature map F. Specifically, for each channel of the fused feature map F, we compute the spatial average, generating a channel descriptor vector that summarizes the global spatial context of each channel.

\begin{aligned} S_{C} = \frac{1}{h \times w} \sum_{i = 1}^{h} \sum_{j = 1}^{w} F_{C} (i, j) \end{aligned}

(2)

where S denote the channel-wise feature vector obtained through the global average pooling operation, with dimensions

1 \times 1 \times C

. The element

S_{c}

represents the c-th component of the feature vector S, which corresponds to the global average value of the c-th channel of the fused feature map F. Here, h and w index the spatial dimensions of the fused feature map F, where

h \in 1, 2, \dots, H

and

w \in 1, 2, \dots, W

. The term

F_{c} (i, j)

denotes the element at spatial location

(i, j)

in the c-th channel of the fused feature map F, with i, j, and c indexing the height, width, and channel dimensions respectively.

Through the global average pooling operation defined in the Equation, the 3D fused feature map F is compressed into a 1D channel-wise feature vector S. This vector S effectively captures the global response magnitude of the fused feature map along each channel and can be regarded as a global channel descriptor, serving as the input to the subsequent channel attention selection module.

Select: The Select operation maps the feature vectors of visible light and infrared light separately using the softmax function:

\begin{aligned} V_{a t t} = \frac{e^{V_{j}}}{\sum_{k = 1}^{c} e^{V_{k}}}, I_{a t t} = \frac{e^{I_{j}}}{\sum_{k = 1}^{c} e^{I_{k}}} \end{aligned}

(3)

where V and

I

represent the channel-attention weights for visible light and infrared light, respectively.

V_{j}

and

I_{j}

represent the values of the

j - t h

channel, and c represents the number of feature channels. We then weight the infrared and visible feature maps pre-processed before the Fuse operation with the obtained attention weights. This yields the fused feature map after applying both the Fuse and Select operators.

\begin{aligned} F^{'} = F_{v i} \cdot V_{a t t} + F_{i r} \cdot I_{a t t} \end{aligned}

(4)

Residual: After concatenating the input feature maps from visible and infrared images, a 3 × 3 convolution is applied to reduce the dimensionality. This result is then element-wise summed with the fusion output from Fuse and Select. The residual branch effectively combines the original features of the input feature map at that scale with global features.

The final output of SCFusion is composed of the fusion features from the Fuse-Select branch and the features from the Residual branch, added together:

\begin{aligned} F_{O u t p u t} = F^{'} + F_{R e s i d u a l} \end{aligned}

(5)

The entire process of the SCFusion module consists of the Fuse-Select branch and the Residual branch. In the Fuse-Select branch, both visible and infrared undergo a preliminary 3 × 3 convolution for individual feature processing. Then, they are fused using summation, followed by compression into a channel feature vector of shape $(b, c)$ . Through a fully connected layer, this vector is further reduced to a compact feature vector of shape $(b, c / 2)$ . This compact feature vector provides precise guidance for the subsequent attention weights of visible and infrared light, while also enhancing computational efficiency. The fused feature vector is fed through two consecutive fully connected layers to obtain the respective feature vectors, where the channel dimension is increased back to c from $c / 2$ . These two consecutive fully connected layers implement a process of dimensionality reduction followed by expansion. This operation learns selectable channel features for both visible and infrared light through the fully connected layers. The obtained channel weights are then used to weight the feature maps of visible and infrared, resulting in a feature-weighted fused image. Meanwhile, in the Residual branch, the input feature map is simply processed with a 3 × 3 convolution and then added to the features learned by the Fuse-Select branch. This preserves the initial feature information while also weighting the fused features from different modalities. Thus, we achieve the entire process of SCFusion.

3.4 Two-stage training strategy

The loss function used in the first stage of network training is identical to the fusion framework we proposed. After feature extraction, the obtained features serve as the feature loss function, while the fused output after feature reconstruction serves as the structural loss. Since the first-stage network takes only one image input, the training process involves calculating the loss based on the feature encoding of the input image.

During the training of the entire fusion framework, even though there are two inputs, visible light and infrared light, for feature extraction and feature reconstruction, the network architecture remains exactly the same. At this stage, the weights acquired during the initial stage are unchanged . This allows simultaneous feature extraction from both visible and infrared light. Unlike the one-stage training, the extracted features are fed into the fusion module. When calculating the loss, the output of the fusion module is used as the feature loss. The fused output is then passed through the pre-trained feature reconstruction network to obtain the final fused image feature. This final feature is used for the structural loss along with the feature encoding of the input visible image.

From Figure 4 above, it can be observed that employing the same loss function, we derived distinct sets of weight parameters in two stages. Through the first-stage network training, weight parameters for both feature extraction and feature reconstruction components were obtained. In contrast, the second-stage network yielded weight parameters specific to the fusion module.

Figure 4.

The schematic diagram of the two-stage training strategy.

3.5 Loss function

Due to the absence of ground truth in the infrared and visible image fusion algorithm,³³ the impact of the loss function on fusion performance is crucial.³⁴ In order to preserve both texture and structural information in the fused image, the loss function of the proposed framework consists of a feature loss function during the fusion process and a structural loss function after fusion output.

The feature loss function computes the sum of encoded feature elements for the fused image with respect to both infrared and visible light, as well as the mean squared error loss between the fused image and the maximum encoded feature element set of both infrared and visible light. Additionally, optimization is performed on the fused images of different scales with respective weights. The specific implementation is as follows:

\begin{aligned} L_{f e a t} = \sum_{s = 1}^{S} ω_{1} (∥ φ_{f}^{s} - (ω_{I 1} φ_{I 1}^{s} + ω_{I 2} φ_{I 2}^{s}) ∥_{F}^{2} + ∥ φ_{f}^{s} - m a x (φ_{I 1}^{s}, φ_{I 2}^{s}) ∥_{F}^{2}) \end{aligned}

(6)

where s represents the scale of feature extraction, f denotes the fused image,

I 1

and

I 2

are the input source images for infrared and visible light respectively. Therefore,

φ_{f}^{s}

represents the feature map after fusion,

φ_{I 1}^{f}

and

φ_{I 2}^{f}

are the encoded feature maps of the source images. The feature extraction network produces features at four different scales.

ω_{1}

is a weight vector that encapsulates the weighting parameters applied to the fused features at each scale, with individual elements corresponding to the respective scale's weight.

ω_{I 1}

and

ω_{I 2}

denote the weighting parameters for the infrared and visible inputs, respectively. The term

m a x

denotes the calculation of the maximum value within an element set, emphasizing the determination of the highest value within a given set of elements.

This loss design guides the model to generate fusion results that are as similar as possible to real images. Due to the amplification of larger error terms in the MSE Loss, the model pays more attention to pixels that deviate significantly from real images during the training process, thus preserving more detailed information as much as possible. However, considering that our fusion framework incorporates diverse feature information at multiple scales, we further design the loss function from a complementary perspective, introducing additional constraints to guide the fusion features with more information.³⁵ Complementary loss is formulated as follows:

\begin{aligned} L_{c o m p l e m e n t a r y} = \sum_{s = 1}^{S} ∥ φ_{I 1}^{s} - φ_{I 2}^{s} ∥_{1} \end{aligned}

(7)

where s represents the index of the scale, S denotes the total number of scales, and

φ_{I 1}^{s}

and

φ_{I 2}^{s}

represent the infrared and visible features at that scale, respectively. Through the application of complementary loss, we perform a norm regularization on the encoding of input features, emphasizing the complementarity of infrared and visible light inputs across different scales. For instance, certain details are more easily observable in visible light, while others are more prominent in infrared. By introducing the complementary loss, the model is encouraged to specifically focus on the differences between infrared and visible light inputs in particular regions when merging features.³⁶ This facilitates a more effective utilization of complementary information between the two input sources, thereby enhancing overall performance.

The structural loss function involves applying a multi-scale structural similarity MS-SSIM loss to the output of the fusion network after feature extraction, feature fusion, and feature reconstruction, compared with the output of visible light. The specific implementation is as follows:

\begin{aligned} L_{s t r u c t} = 1 - \prod (C S_{i} (f, v i) \land ω_{i} * (S S I M_{i} (f, v i) \land β_{i})) . \end{aligned}

(8)

where

C S_{i}

represents the luminance weight at the

i - t h

. scale,

S S I M_{i}

represents the structural similarity index at that scale. f and

v i

represent the encoding of the fused image and the source visible light image, respectively.

ω_{i}

and

β_{i}

are weighting coefficients. The weighting coefficient

ω_{i}

regulates the contribution of the luminance and contrast similarity at each scale i, while the weighting coefficient

β_{i}

regulates the contribution of the structural similarity at that scale. The symbol

Π

represents a product operation over all scales.

By introducing the MS-SSIM Loss, differences between the fused image and the visible image are optimized in terms of luminance, contrast, and structure. Weighting the information at different scales allows for a more comprehensive assessment of the quality of the fused image. This enables the output image to better retain the characteristics of the original image, making the fused image perceptually more similar to the visible image. This leads to an improvement in the quality of image fusion and the preservation of more details, while reducing the potential introduction of visually noticeable artifacts or false details.

The loss function for the entire network model training is defined as:

\begin{aligned} L_{f u s i o n} = L_{f e a t} + α L_{s t r u c t} + β L_{c o m p l e m e n t a r y} \end{aligned}

(9)

where the feature loss

L_{f e a t}

serves as the base loss term for preserving source image information at the feature level.

α

and

β

are two key weighting parameters.

α

is used to balance the relative importance between the structural loss

L_{s t r u c t}

and the feature loss

L_{f e a t}

and complementary loss

L_{c o m p l e m e n t a r y}

as a combined feature-related component, allowing adjustment of the fusion result's emphasis on perceptual structural similarity versus source feature preservation.

β

is employed to regulate the contribution of the complementary loss

L_{c o m p l e m e n t a r y}

to the total loss, precisely controlling the model's focus on unique inter-modal complementary information. This allows for a targeted emphasis and utilization of such information while pursuing overall feature similarity.

4 Experiments

In this section, we experimentally validate the proposed fusion method. We will provide detailed explanations regarding dataset selection, evaluation metrics, experimental configurations, ablation experiments on the proposed network, as well as performance comparisons with other fusion methods. Through both quantitative and qualitative perspectives, we aim to thoroughly validate the effectiveness of the fusion approach. Finally, we explore the application of fused images in the domain of semantic segmentation.

4.1 Dataset

Regarding the dataset, for the first stage training, we utilized the COCO2014 dataset,³⁷ which includes 82,783 training images and 40,504 validation images, to train our feature extraction network and reconstruction network. In the second stage training process, we selected the KAIST³⁸ dataset to train our fusion module. The KAIST dataset encompasses diverse typical traffic scenarios, including daytime and nighttime campus, street, and rural scenes. The KAIST dataset contains 95,328 pairs of images, and we chose 80,000 pairs of infrared and visible images from it. We split these 80,000 pairs into training and validation sets with a ratio of 9:1, specifically using 72,000 pairs for training and 8000 pairs for validation. We resized the images to a shape of 256 × 256 and converted them into grayscale for training our fusion network.

For the test set, we selected three representative datasets: TNO,³⁹ RoadScene,⁴⁰ and VOT-2020.⁴¹ From each of these datasets, we selected 50 pairs of infrared and visible images for fusion testing. The TNO dataset is one of the most common datasets in the field of image fusion, containing infrared and visible light images in military-related scenarios. RoadScene includes vehicles, pedestrians, and traffic signs, allowing us to evaluate the fusion performance of the model in complex scenarios. The VOT-2020 dataset is a benchmark for short-term tracking of visual objects in RGB. By evaluating our proposed model on these three test sets, which represent different scenarios and lighting conditions, we can comprehensively assess the performance of our model.

4.2 Evaluation metrics

We have selected the following metrics to evaluate the performance of image fusion: Entropy (EN): Higher entropy indicates a richer information content in the fused image. Mutual Information (MI): Higher MI implies that more information from the source images has been transferred to the fused image. Spatial Frequency (SF): Higher SF indicates that the fused image contains richer edge and texture details. Standard Deviation (SD): The main subject reflects the contrast and distribution of the fused image. The human visual system is often attracted to regions with higher contrast, so a higher SD indicates better contrast. Visual Information Fidelity (VIF): A higher VIF indicates that the fusion result is more consistent with human visual perception. Multi-scale Structural Similarity Index Measure (MS-SSIM): MS-SSIM models the information loss and distortion during the fusion process, reflecting the structural similarity between the fused image and the source images. Therefore, a higher value of this metric indicates less information loss and distortion.

4.3 Comparative methods

We compared our method with nine mainstream approaches, including DenseFuse, FusionGan, LRRNet, MFEIF, RFN-Nest, STDFusion, SwinFusion, U2Fusion, and YDTR. All competitive methods have publicly available code. We conducted inference on the test set using the original parameters provided in the respective papers and the model weights provided by the authors.

4.4 Experimental configurations

In terms of network model parameter settings, given the ample richness of image pairs in the training set, we set the number of epochs to 2. The batch size and learning rate were respectively set to 4 and $1 \times 10^{- 4}$ .

In the feature extraction network, we performed three downsampling operations on the extracted features to obtain four different scales of feature maps. These maps were then fed into the feature fusion module for training. During training, we assigned weights $ω_{1}$ to the fusion output. At the 1×, 2×, 4×, and 8×scales, $ω_{1}$ was set to 1, 10, 100, and 1000, respectively. Larger loss function weights were applied to deeper features.

Furthermore, when calculating the Mean Squared Error Loss between the fusion input and the encoded features of visible and infrared light, we assigned weights $ω_{I 1}$ and $ω_{I 2}$ to the visible and infrared light, respectively. Since the SCFusion module has two attention vectors, we set their weights equal during training, with $ω_{I 1}$ and $ω_{I 2}$ both set to 3.0.

The weighting parameters $α$ and $β$ for the loss function were empirically set to 700 and 1.5, respectively. These values were determined through preliminary parameter tuning experiments, aiming to achieve an optimal balance among structural preservation, detail extraction, and the utilization of inter-modal complementary information in the fused image. The values of $ω_{I 1}$ , $ω_{I 2}$ , $α$ and $β$ were determined through the ablation experiments conducted as follows. The specific parameter settings for our model are presented in Table 1.

Table 1.
Settings of loss function parameter in the training framework.

$α$ $β$ scale $ω_{1}$ $ω_{I 1}$ $ω_{I 2}$

700 1.5 1 1 3.0 3.0

2 10

3 100

4 100

$α$	$β$	scale	$ω_{1}$	$ω_{I 1}$	$ω_{I 2}$
700	1.5	1	1	3.0	3.0
2	10
3	100
4	100

The proposed fusion network was implemented in a Python 3.7 programming environment using PyTorch 1.12.0 on an RTX 3090 GPU, running on a system with an Intel Core i7 processor, 32GB of RAM.

4.5 Ablation experiment

Given that the fusion framework proposed in this paper can be viewed as a two-stage framework from a training perspective, this section will investigate how to adjust the weight parameter settings to enhance the performance of the fusion module, thereby improving the overall performance of the fusion framework.

Since the first stage trained the encoder-decoder network and its weights were fixed, the ablation experiments focused on the SCFusion fusion module at four different scales. To determine the optimal settings for the weight parameters in our proposed loss function, defined as: $L_{f u s i o n} = L_{f e a t} + α L_{s t r u c t} + β L_{c o m p l e m e n t a r y}$ , we employed a step-by-step ablation study approach. Here, $α$ is used to balance the structural loss and feature loss, while $β$ balances the contribution of the complementary loss. Firstly, to investigate the impact of the structural loss weight $α$ , we fixed the complementary loss weight parameter $β = 1$ . We conducted experiments by setting $α$ to 0 and several non-zero values to study its influence on the performance of our proposed fusion network.

Setting $α$ to 0 means that the network is trained solely using the feature loss function from the fusion module. This leads to a slight reduction in the fusion module's ability to incorporate fine-grained information. However, the model is able to reduce discrepancies from the source images, resulting in fused images that are more similar to the source images. Since the $α$ in our design primarily extracts fine-grained information from the visible-light image, when $α = 0$ , the model has a more balanced ability to fuse information from both visible and infrared light.

When $α$ is non-zero, the model has a stronger ability to acquire detailed information from the visible-light image, which qualitatively aligns better with human visual discernment. Each setting of the two weights has its own merits and demerits. Our parameter settings and metric results can be found in Table 2. Through ablation experiments on the TNO dataset, we found that $α = 700$ yielded the best performance across six metrics. Therefore, we set $α$ to 700 in the subsequent experiments.

Table 2.
The average values of the objective metric obtained with different values of $α$ while keeping $ω_{1} = 0.3, ω_{2} = 0.3$ constant on the TNO dataset.

$α$ EN MI SF SD VIF MS_SSIM

0 6.73 2.232 8.981 33.189 0.628 0.936

100 6.835 2.316 8.905 36.861 0.695 0.916

300 6.984 2.852 9.556 41.789 0.806 0.916

500 6.922 3.569 9.799 45.239 0.917 0.926

700 7.002 3.944 9.993 47.101 0.955 0.936

900 6.978 3.346 9.683 45.543 0.892 0.916

$α$	EN	MI	SF	SD	VIF	MS_SSIM
0	6.73	2.232	8.981	33.189	0.628	0.936
100	6.835	2.316	8.905	36.861	0.695	0.916
300	6.984	2.852	9.556	41.789	0.806	0.916
500	6.922	3.569	9.799	45.239	0.917	0.926
700	7.002	3.944	9.993	47.101	0.955	0.936
900	6.978	3.346	9.683	45.543	0.892	0.916

The second detail we need to study through ablation experiments is the setting of $ω_{I 1}$ and $ω_{I 2}$ in the feature loss function: $L_{f e a t} = \sum_{s = 1}^{S} ω_{1} (∥ φ_{f}^{s} - (ω_{I 1} φ_{I 1}^{s} + ω_{I 2} φ_{I 2}^{s}) ∥_{F}^{2} + ∥ φ_{f}^{s} - m a x (φ_{I 1}^{s}, φ_{I 2}^{s}) ∥_{F}^{2})$ , $ω_{I 1}$ and $ω_{I 2}$ control the weights of visible and infrared light in the fusion process, respectively. Given that we had fixed $β$ when determining the optimal value of $α$ , and to ensure experimental independence and mitigate the complexity arising from parameter interactions, we also kept $β$ fixed and used the optimal αα value determined in the previous step when investigating the optimal settings for $ω_{I 1}$ and $ω_{I 2}$ . Since our proposed SCFusion module performs attention selection on both visible and infrared images, both weights have a better effect when set equally. Therefore, we only need to experimentally determine the optimal weights for both parameters. We set the range of these two parameters to be between [0.5, 6.0].

According to Table 3, we can observe that when the structural loss function is incorporated, i.e., when $α$ is non-zero, our fusion algorithm exhibits superior performance. Even though the structural loss function enhances the model's fusion capability for visible light, SCFusion's selective attention to both visible and infrared light allows our fusion algorithm to outperform across various metrics when $α$ weight is assigned during training.

Table 3.

The average values of the objective metric obtained with different parameters ( $α, ω_{1}, ω_{2}$ ) on TNO.

$α$	$ω_{I 1}$	$ω_{I 2}$	EN	MI	SF	SD	VIF	MS_SSIM
0	0.5	0.5	6.746	2.232	8.942	33.233	0.628	0.933
	2.0	3.0	6.808	2.499	9.526	36.598	0.673	0.928
	2.0	4.0	6.861	2.827	9.955	39.594	0.734	0.929
	3.0	3.0	6.73	2.232	8.981	33.189	0.628	0.936
	4.0	6.0	6.813	2.416	9.344	36.578	0.669	0.927
	5.0	6.0	6.774	2.347	9.243	34.623	0.642	0.92
	6.0	3.0	6.641	2.291	7.768	30.139	0.661	0.959
	6.0	6.0	6.734	2.23	8.993	33.199	0.628	0.937
700	2.0	4.0	6.99	3.745	10.201	46.465	0.955	0.933
	3.0	3.0	7.002	3.944	9.993	47.101	0.955	0.936
	3.0	5.0	6.994	3.768	10.13	46.858	0.951	0.933
	4.0	6.0	6.943	2.92	9.546	42.389	0.827	0.928
	5.0	0.5	6.923	2.82	9.183	43.001	0.796	0.923
	5.0	6.0	6.934	2.921	9.328	42.231	0.82	0.92
	6.0	0.5	6.93	3.735	9.638	45.161	0.812	0.897
	6.0	3.0	6.916	2.591	8.34	40.09	0.731	0.914
	6.0	6.0	6.891	2.502	8.817	39.374	0.73	0.913

In terms of testing the feature loss function, our ablation experiments have revealed that equal weighting or slightly greater weighting towards visible light yields better fusion performance when compared to other weight settings. This implies a higher priority for visible light in our fusion algorithm. The experimental result aligns with our initial hypothesis, as the human eye is highly sensitive to visible light. Visible light carries more abundant environmental information, which is crucial for the fusion algorithm. On the other hand, infrared light is relatively weaker in terms of color and detail. Assigning a higher weight to visible light enhances the richness of information in the image, which is in line with the original design concept of our attention fusion module.

Following the determination of the optimal settings for the structural loss weight $α$ and the modal weights $ω_{I 1}$ , $ω_{I 2}$ in the feature loss, the final step to complete the tuning of all weight parameters in the loss function is to investigate the impact of the complementary loss weight $β$ . The parameter $β$ directly regulates the contribution of the complementary loss term to the training process, thereby influencing how the model focuses on and learns the complementary features of the source images during fusion. Based on the optimal $α$ and $ω_{I 1}$ , $ω_{I 2}$ values determined in the preceding two steps, we now conduct an ablation study on $β$ to find its optimal value, evaluating its performance across a range from 0 to 3 with a step size of 0.5.

Table 4 presents the results of our ablation study investigating the impact of the weight parameter $β$ for the newly introduced compensation loss term on model performance. Firstly, comparing the results for $β = 0$ and $β = 1.0$ , it can be seen that when $β$ is set to 0, meaning the compensation loss term is not used, fusion metrics are significantly lower than the case when $β = 1.0$ . This strongly proves the importance of introducing the compensation loss term for guiding the model to learn and preserve complementary information from the source images, thereby improving fusion performance. Further analyzing the results for different non-zero values of $β$ , we observed that as the value of $β$ increases from 0, the overall model performance generally shows an upward trend, achieving the best comprehensive performance at $β = 1.5$ . Specifically, compared to the original paper's setting of $β = 1.0$ , when $β = 1.5$ , the model achieved more excellent results on multiple key metrics, including EN, MI, SF, SD, and VIF. Although the MS_SSIM metric slightly decreased, considering the performance of all evaluation metrics comprehensively, $β = 1.5$ achieved the optimal balance and fusion effect overall. Therefore, we choose 1.5 as the final weight parameter for the compensation loss term to maximize the model's fusion performance.

Table 4.

The average values of the objective metric obtained with different values of while keeping $α = 700, ω_{1} = 0.3, ω_{2} = 0.3$ constant on the TNO dataset.

$β$	EN	MI	SF	SD	VIF	MS_SSIM
0	6.731	2.232	8.981	33.189	0.628	0.936
0.5	6.835	2.316	8.905	36.861	0.695	0.916
1.0	7.002	3.944	9.993	47.101	0.955	0.938
1.5	7.015	3.958	9.982	47.155	0.959	0.937
2.0	6.951	3.802	9.853	46.507	0.941	0.931
2.5	6.904	3.205	9.508	45.012	0.883	0.912
3.0	6.887	3.189	9.456	44.809	0.878	0.907

4.6 Results analysis on TNO

We achieved promising results on the TNO test set. SelectiveFusion employs a selectively attentive mechanism to suppress background noise and selectively enhance fusion features. Information regarding this can be gleaned from Table 4, our model consistently outperforms in both the SD and VIF metrics. The SwinFusion algorithm leverages CNN for local information extraction and Swin Transformer for global information extraction, yielding excellent results in quantitative metrics as well. (Table 5)

Table 5.
Quantitative results on infrared and visible images collected from TNO.

Method EN MI SF SD VIF MS_SSIM

CrossFuse 6.755 3.351 9.148 40.124 0.789 0.897

DenseFuse 6.174 2.135 6.048 22.546 0.534 0.871

FusionGan 6.363 2.249 5.797 26.068 0.413 0.656

LRRNet 6.406 1.64 9.603 24.467 0.023 0

MFEIF 6.539 2.618 6.699 30.548 0.621 0.913

MixFuse 6.852 3.567 10.056 42.589 0.887 0.934

RFN-Nest 6.841 2.027 6.135 35.27 0.562 0.89

STDFusionNet 6.195 2.498 6.484 24.492 0.536 0.843

SwinFusion 6.681 3.208 10.283 37.841 0.708 0.938

TextFusion 6.911 3.782 9.011 42.115 0.921 0.913

U2Fusion 6.294 2.017 5.719 24.897 0.503 0.869

YDTR 6.227 2.416 6.925 24.056 0.546 0.867

Proposed 7.015 3.958 9.982 47.155 0.959 0.937

Method	EN	MI	SF	SD	VIF	MS_SSIM
CrossFuse	6.755	3.351	9.148	40.124	0.789	0.897
DenseFuse	6.174	2.135	6.048	22.546	0.534	0.871
FusionGan	6.363	2.249	5.797	26.068	0.413	0.656
LRRNet	6.406	1.64	9.603	24.467	0.023	0
MFEIF	6.539	2.618	6.699	30.548	0.621	0.913
MixFuse	6.852	3.567	10.056	42.589	0.887	0.934
RFN-Nest	6.841	2.027	6.135	35.27	0.562	0.89
STDFusionNet	6.195	2.498	6.484	24.492	0.536	0.843
SwinFusion	6.681	3.208	10.283	37.841	0.708	0.938
TextFusion	6.911	3.782	9.011	42.115	0.921	0.913
U2Fusion	6.294	2.017	5.719	24.897	0.503	0.869
YDTR	6.227	2.416	6.925	24.056	0.546	0.867
Proposed	7.015	3.958	9.982	47.155	0.959	0.937

Based on the qualitative analysis of the fusion results on TNO from the Figure 5 below, we can observe our emphasis on the pedestrian portions of the dataset. In the infrared image, we can discern the outlines and approximate body postures of two pedestrians. However, in the visible light image, due to dim lighting conditions, obtaining clear information about both individuals is challenging. Most fusion methods tend to produce results closer to the infrared source image, focusing more on outline information regarding the two pedestrians, resulting in a relatively blurry fusion outcome. Moreover, fusion results from other algorithms fail to preserve details such as the pedestrian's backpack on the right, and often exhibit white artifacts. In contrast, our SelectiveFusion algorithm's fusion results align more with human visual habits. The details and texture of the backpack are clearly visible in SelectiveFusion ‘s fusion result.

Figure 5.

The experimental results on the TNO dataset. (a) Infrared; (b) Visble; (c) CrossFuse; (d) DenseFuse; (e) FusionGan; (f) LRRNet; (g) MFEIF; (h) MixFuse; (i) RFN-Nest; (j) STDFusionNet; (k) SwinFusion; (l) TextFusion; (m) U2Fusion; (n) YDTR; (o) Proposed.

4.7 Results analysis on RoadScene

Table 6 provides specific insights into our model's performance on the RoadScene test set. Our model achieves the first rank in quantitative metrics, specifically MI, SD, and VIF, and secures the second position in EN and SF. Simultaneously, SwinFusion, RFN-Nest, MFEIF and MixFuse also demonstrate their strengths in performance metrics on this test set.

Table 6.
Quantitative results on infrared and visible images collected from RoadScene.

Method EN MI SF SD VIF MS_SSIM

CrossFuse 6.958 3.485 9.88 46.905 0.672 1.072

DenseFuse 6.916 3.116 7.626 35.359 0.628 1.181

FusionGan 7.032 2.949 7.881 39.319 0.405 0.906

LRRNet 6.801 2.277 10.198 31.194 0.017 0

MFEIF 7.146 3.384 8.663 43.141 0.693 1.203

MixFuse 6.971 3.491 9.95 47.012 0.678 1.06

RFN-Nest 7.351 2.976 6.677 47.998 0.573 1.171

STDFusionNet 6.709 3.269 6.417 33.497 0.578 1.094

SwinFusion 6.964 3.488 10.92 46.981 0.675 1.178

TextFusion 6.96 3.429 10.202 46.99 0.676 1.039

U2Fusion 7.1 3.09 8.44 42.619 0.601 1.18

YDTR 6.965 3.226 9.183 38.18 0.633 1.162

Proposed 7.272 3.776 10.302 48.074 0.905 1.075

Method	EN	MI	SF	SD	VIF	MS_SSIM
CrossFuse	6.958	3.485	9.88	46.905	0.672	1.072
DenseFuse	6.916	3.116	7.626	35.359	0.628	1.181
FusionGan	7.032	2.949	7.881	39.319	0.405	0.906
LRRNet	6.801	2.277	10.198	31.194	0.017	0
MFEIF	7.146	3.384	8.663	43.141	0.693	1.203
MixFuse	6.971	3.491	9.95	47.012	0.678	1.06
RFN-Nest	7.351	2.976	6.677	47.998	0.573	1.171
STDFusionNet	6.709	3.269	6.417	33.497	0.578	1.094
SwinFusion	6.964	3.488	10.92	46.981	0.675	1.178
TextFusion	6.96	3.429	10.202	46.99	0.676	1.039
U2Fusion	7.1	3.09	8.44	42.619	0.601	1.18
YDTR	6.965	3.226	9.183	38.18	0.633	1.162
Proposed	7.272	3.776	10.302	48.074	0.905	1.075

We can conduct a qualitative analysis using a representative image from the test set, which primarily depicts roadside buildings with complex structures. Additionally, the uneven illumination from nearby and distant streetlights results in uneven brightness distribution in the image. The fusion results can be observed in Figure 6. Certain fusion results from methods like FusionGan, LRRNet, and STDFusion fail to distinctly differentiate the transitions between light and dark areas. SwinFusion, MFEIF and MixFusion outperform other algorithms in fusion quality. However, these algorithms, although they distinguish light and dark areas, exhibit some light distortion in the fusion results, which may not entirely align with human visual perception. The MixFusion method integrates more infrared information, resulting in slightly blurred edges. RFN-Nest demonstrates commendable fusion results in this particular image.In contrast, SelectiveFusion provides superior details in the window illuminated by the streetlight in the upper left corner, and the light transition on the wall illuminated in the lower right corner appears more natural.

Figure 6.

The experimental results on the roadScene dataset. (a) Infrared; (b) Visble; (c) CrossFuse; (d) DenseFuse; (e) FusionGan; (f) LRRNet; (g) MFEIF; (h) MixFuse; (i) RFN-Nest; (j) STDFusionNet; (k) SwinFusion; (l) TextFusion; (m) U2Fusion; (n) YDTR; (o) Proposed.

4.8 Further analysis on VOT-2020

In the more challenging VOT-2020 test set scenarios, our model quantitatively continues to exhibit excellent performance. Table 7 illustrates that it achieves the top rank on four key metrics (MI, SD, VIF, and MS-SSIM) and is ranked second on EN.

As shown in Figure 7, high exposure has led to the complete loss of the car's contour information in the visible light image. SwinFusion and STDFusion nearly lose detailed information, FusionGan and RFN-Nest capture more infrared information, but may obscure the texture details of pedestrians. LRRNet, which is based on low-rank representation, is designed for effective image information extraction. Nevertheless, its fused output appears somewhat underexposed under high-exposure conditions. While MixFuse excels in information transmission, further improvement is needed in detail retention and the overall visual quality of the fusion. In contrast, SelectiveFusion's fusion result not only retains pedestrian information under visible light but also preserves the contour information of the car from the infrared image.

Table 7.
Quantitative results on infrared and visible images collected from VOT-2020.

Method EN MI SF SD VIF MS_SSIM

CrossFuse 6.307 3.579 7.324 40.977 0.676 0.911

DenseFuse 6.384 2.263 6.901 25.515 0.586 0.965

FusionGan 6.508 2.286 6.547 29.452 0.421 0.714

LRRNet 6.548 1.775 10.216 26.417 0.02 0.002

MFEIF 6.681 2.602 7.802 34.151 0.679 1.002

MixFuse 6.309 4.577 7.326 35.975 0.574 0.942

RFN-Nest 6.745 4.505 6.165 42.092 0.797 0.979

STDFusioNet 6.308 2.578 7.325 26.976 0.575 0.912

SwinFusion 6.877 3.298 11.813 40.821 0.755 0.996

TextFusion 6.511 2.581 6.322 27.979 0.578 0.914

U2Fusion 6.553 2.115 6.486 29.168 0.551 0.969

YDTR 6.454 2.495 8.415 29.01 0.623 0.962

Proposed 6.785 5.226 8.946 42.334 0.971 1.005

Method	EN	MI	SF	SD	VIF	MS_SSIM
CrossFuse	6.307	3.579	7.324	40.977	0.676	0.911
DenseFuse	6.384	2.263	6.901	25.515	0.586	0.965
FusionGan	6.508	2.286	6.547	29.452	0.421	0.714
LRRNet	6.548	1.775	10.216	26.417	0.02	0.002
MFEIF	6.681	2.602	7.802	34.151	0.679	1.002
MixFuse	6.309	4.577	7.326	35.975	0.574	0.942
RFN-Nest	6.745	4.505	6.165	42.092	0.797	0.979
STDFusioNet	6.308	2.578	7.325	26.976	0.575	0.912
SwinFusion	6.877	3.298	11.813	40.821	0.755	0.996
TextFusion	6.511	2.581	6.322	27.979	0.578	0.914
U2Fusion	6.553	2.115	6.486	29.168	0.551	0.969
YDTR	6.454	2.495	8.415	29.01	0.623	0.962
Proposed	6.785	5.226	8.946	42.334	0.971	1.005

Figure 7.

The experimental results on Vot-2020 (a) infrared; (b) visble; (c) CrossFuse; (d) DenseFuse; (e) FusionGan; (f) LRRNet; (g) MFEIF; (h) MixFuse; (i) RFN-nest; (j) STDFusionNet; (k) swinFusion; (l) TextFusion; (m) U2Fusion; (n) YDTR; (o) proposed.

4.9 Application to semantic segmentation

We conducted research on the affirmative application of fused images in the field of semantic segmentation. In particular, we selected 1000 pairs of infrared and visible image sets from the MSRS dataset.⁴² The image sets were fused using our SelectiveFusion fusion network, forming three distinct sets that were employed as separate training datasets. These sets were employed to train the lightweight semantic segmentation network, BiSeNet v2.⁴³ Subsequently, testing was performed on 360 images, and the quantitative and visual results are presented in Table 7 and Figure 8, respectively. It can be observed that BiSeNet v2 trained with fused results successfully segments pedestrians and vehicles that cannot be accurately identified in single-modal images. The adaptive selection fusion introduced by our proposed SelectiveFusion ensures the meaningful integration of information from low-quality source images. Consequently, the segmentation performance for pedestrians and vehicles surpasses that of single-modal images. (Table 8)

Figure 8.

Segmentation results for infrared, visible and fused images on the MSRS dataset.

5 Conclusion

In this paper, we propose a fusion model named Selective Fusion, which follows a two-stage training framework. Additionally, we introduce a novel fusion module called Selective Channel Fusion, designed for the selective weighting of channels in two modalities. Finally, we introduce a new loss function that constrains training from the perspective of compensating information. Through extensive experiments involving representative algorithms and evaluation on six metrics, we demonstrate the superiority of our Selective Fusion fusion algorithm. It exhibits adaptability to diverse lighting conditions and fusion environments across the TNO, RoadScene, and VOT-2020 test sets. The algorithm demonstrates strong competitiveness in handling different scenes and illumination scenarios.

Upcoming research is anticipated to explore the nuances within the image fusion domain, aiming to seamlessly integrate the attributes of infrared and visible light through the application of deep learning technologies. These investigations will specifically address challenges posed by significant illumination imbalances in diverse lighting conditions. The outcomes of such research endeavors are poised to significantly contribute to refining and enhancing image fusion methodologies.

Table 8.

Segmentation performance(mIoU) of infrared, visible and fused images on the MSRS dataset.

	Background	Car	Person
Infrared Image	98.77%	88.17%	71.45%
Visible Image	98.63%	89.45%	63.26%
Fused Image	98.89%	90.23%	72.35%

Footnotes

Acknowledgements

This research was funded by the Program of Promoting the Development of University—Diligence Talents (Beijing Information Science and Technology University, grant no. 5112111145).

ORCID iDs

Xufan Miao

Mingxin Yu

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Zhang

. TBEFN: a two-branch exposure-fusion network for low-light image enhancement. IEEE Trans Multimed 2020; 23: 4093–4105.

Liang

, et al. Infrared and visible image fusion via detail preserving adversarial learning. Inf Fusion 2020; 54: 85–98.

Zhang

. Multi-focus image fusion based on sparse feature matrix decomposition and morphological filtering. Opt Commun 2015; 342: 1–11.

Zhu

Zhang

, et al. Learning local-global multi-graph descriptors for RGB-T object tracking. IEEE Trans Circuits Syst Video Technol 2018; 29: 2913–2926.

Bin

Chao

Guoyu

. Efficient image fusion with approximate sparse representation. Int J Wavelets Multiresolution Inf Process 2016; 14: 1650024.

Kittler

. Deep decomposition network for image processing: A case study for visible and infrared image fusion. arXiv preprint arXiv:2102.10526. 2021.

Kang

. Image fusion with guided filtering. IEEE Trans Image Process 2013; 22: 2864–2875.

Zhou

Zhang

, et al. Infrared and visible image fusion based on target extraction in the nonsubsampled contourlet transform domain. J Appl Remote Sens 2017; 11: 015011–015011.

Zhang

Liu

Sun

, et al. IFCNN: a general image fusion framework based on convolutional neural network. Inf Fusion 2020; 54: 99–118.

10.

Tang

Yuan

Zhang

, et al. PIAFusion: a progressive infrared and visible image fusion network based on illumination aware. Inf Fusion 2022; 83: 79–92.

11.

. Multi-focus image fusion using dictionary learning and low-rank representation. In: Image and Graphics: 9th International Conference, ICIG 2017, Shanghai, China, September 13–15, 2017, Revised Selected Papers, Part I 9, 2017, pp.675–686: Springer International Publishing.

12.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.4700–4708.

13.

Tang

Yuan

. Image fusion in the loop of high-level vision tasks: a semantic-aware real-time infrared and visible image fusion network. Inf Fusion 2022; 82: 28–42.

14.

Kittler

. Infrared and visible image fusion using a deep learning framework. In: 2018 24th international conference on pattern recognition (ICPR), 2018, pp.2705–2710: IEEE.

15.

Liang

, et al. RGB-T object tracking: benchmark and baseline. Pattern Recognit 2019; 96: 106977.

16.

Xia

, et al. RGB-T image saliency detection via collaborative graph learning. IEEE Trans Multimed 2019; 22: 160–173.

17.

Liu

Chen

Ward

, et al. Image fusion with convolutional sparse representation. IEEE Signal Process Lett 2016; 23: 1882–1886.

18.

. Infrared and visible image fusion methods and applications: a survey. Inf Fusion 2019; 45: 153–178.

19.

Zhang

. SDNet: a versatile squeeze-and-decomposition network for real-time image fusion. Int J Comput Vision 2021; 129: 2761–2785.

20.

Kittler

. RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion 2021; 73: 72–86.

21.

. Densefuse: a fusion approach to infrared and visible images. IEEE Trans Image Process 2018; 28: 2614–2623.

22.

Liang

, et al. FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf Fusion 2019; 48: 11–26.

23.

Jiang

, et al. U2Fusion: a unified unsupervised image fusion network. IEEE Trans Pattern Anal Mach Intell 2020; 44: 502–518.

24.

Tang

Fan

, et al. Swinfusion: cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica 2022; 9: 1200–1217.

25.

Tang

Liu

. YDTR: infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans Multimed 2022; 25: 5413–5428.

26.

Song

Liu

, et al. Mixfuse: an iterative mix-attention transformer for multi-modal image fusion. Expert Syst Appl 2025; 261: 125427.

27.

. Crossfuse: a novel cross attention mechanism based infrared and visible image fusion approach. Inf Fusion 2024; 103: 102147.

28.

Liu

Fan

Jiang

, et al. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans Circuits Syst Video Technol 2021; 32: 105–119.

29.

Tang

, et al. STDFusionnet: an infrared and visible image fusion network based on salient target detection. IEEE Trans Instrum Meas 2021; 70: 1–13.

30.

, et al. LRRNet: a novel representation learning guided fusion network for infrared and visible images. IEEE Trans Pattern Anal Mach Intell 2023; 45: 11040–11052.

31.

Cheng

, et al. Textfusion: unveiling the power of textual semantics for controllable image fusion. Inf Fusion 2025; 117: 102790.

32.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. Unet++: a nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 2018, pp.3–11: Springer International Publishing.

33.

Zhang

Xiao

, et al. Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. Proc AAAI Conf Artif Intell 2020; 34: 12797–12804.

34.

Durrani

. Nestfuse: an infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans Instrum Meas 2020; 69: 9645–9656.

35.

Meng

, et al. Cap-yolo: channel attention based pruning yolo for coal mine real-time intelligent monitoring. Sensors 2022; 22: 4331.

36.

Wang

, et al. Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.510–519.

37.

Lin

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6–12, 2014, proceedings, part v 13, 2014, pp.740–755: Springer International Publishing.

38.

Hwang

Park

Kim

, et al. Multispectral pedestrian detection: benchmark dataset and baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.1037–1045.

39.

Toet

. The TNO multiband image data collection. Data Brief 2017; 15: 249.

40.

, et al. Fusiondn: a unified densely connected network for image fusion. Proc AAAI Conf Artif Intell 2020; 34: 12484–12491.

41.

Kristan

Matas

Leonardis

, , et al. The eighth visual object tracking VOT2020 challenge results. In: Proc. 16th Eur. Conf. Comput. Vis. Workshop, 2020.

42.

Watanabe

Karasawa

, et al. MFNet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp.5108–5115: IEEE.

43.

Gao

Wang

, et al. Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation. Int J Comput Vision 2021; 129: 3051–3068.

SelectiveFusion: Channel attention-driven infrared-visible image fusion network

Abstract

Keywords

1 Introduction

2.1 Traditional methods

2.2 Deep learning-based image fusion method

3 Proposed method

3.1 The architecture of the fusion network

4.1 Dataset

4.2 Evaluation metrics

4.3 Comparative methods

4.4 Experimental configurations

Table 1. Settings of loss function parameter in the training framework. α β scale ω 1 ω I 1 ω I 2 700 1.5 1 1 3.0 3.0 2 10 3 100 4 100

Footnotes

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

References

Table 1.
Settings of loss function parameter in the training framework.

$α$ $β$ scale $ω_{1}$ $ω_{I 1}$ $ω_{I 2}$

700 1.5 1 1 3.0 3.0

2 10

3 100

4 100