Infrared and visible image fusion using two-layer generative adversarial network

Abstract

Infrared (IR) images can distinguish targets from their backgrounds based on difference in thermal radiation, whereas visible images can provide texture details with high spatial resolution. The fusion of the IR and visible images has many advantages and can be applied to applications such as target detection and recognition. This paper proposes a two-layer generative adversarial network (GAN) to fuse these two types of images. In the first layer, the network generate fused images using two GANs: one uses the IR image as input and the visible image as ground truth, and the other with the visible as input and the IR as ground truth. In the second layer, the network transfer one of the two fused images generated in the first layer as input and the other as ground truth to GAN to generate the final fused image. We adopt TNO and INO data sets to verify our method, and by comparing with eight objective evaluation parameters obtained by other ten methods. It is demonstrated that our method is able to achieve better performance than state-of-arts on preserving both texture details and thermal information.

Keywords

IR and visible images image fusion generative adversarial network deep learning

1 Introduction

The fusion of IR and visible images is to integrate the inherent properties of both images and merge their vital information to produce a fused image. As shown in Fig. 1 (a) (b), visible images have the characteristics of high resolution, and rich texture information, while IR images that are captured by IR sensors are normally of high contrast and less affected by weather [1]. The fusion of advantages of both IR and visible images has made success on target detection in military and civilian applications [2, 3]. In the past decades, based on the fusion strategies and theories [4], many researchers have proposed many fusion algorithms, traditional methods, like those [5, 6] based on multi-scale transform, sparse representation [7, 8], neural network [9, 10], subspace [11, 12], and saliency-based methods [13, 14], hybrid models [15, 16], and others [17, 18]. While it is not difficulty to establish the pixels’ correlations between the source images and fused images, but the imageing principles are different so there are no correlations between pixels.

Fig. 1

(a) IR image; (b) Visible image; (c) Fused image by ‘FusionGAN’; (d) Fused image by our method, showing clear texture of the trees and the building.

In recent years, research on deep learning (DL) has become more and more extensive, especially in image processing. DL-based image fusion methods can extract deep features automatically, making the whole fusion process easier. Liu et al. [19, 20] proposed the methods to use connected convolutional networks (CNN) to fuse IR and visible image. The methods have gained a good result, but they are difficult to control the process of generating images, and some information had lost in the process. In recent years, some researches proposed other DL algorithms for IR and visible images [21, 22] fusion. They proposed densely CNN and CSR-based framework for image fusion, respectively. Other researchers also proposed DL-based methods for image fusion [23 –26]. However, DL-based methods are not end-to-end models, and they need to be trained on datasets with images in advance. No matter whether the network weights are generated by training or provided by mature feature extraction models, other transforms or operations are still needed to accomplish the final fusion process [27, 28]. To address this issue, Ma et al. [29] presented a novel IR and visible image fusion algorithm, ‘FusionGAN’, based on GAN. It imitates an adversarial game, between retaining the information of IR image and the information of visible image [29]. The network model of ‘FusionGAN’ is effective and relatively simple, and also can advance the fusion performance through optimizing the net structure or loss functions. Since the method of ‘FusionGAN’ has no ground truth to determine whether the generated data is true or false, the model takes the visible image as the ground truth, which leads to the fused image has more texture details and less thermal radiation information.

Unlike ‘FusionGAN’, in this paper we propose a two-layer GAN network for the target. In the first layer, we feed IR image and visible image to generators (G) to generate fused images respectively, and use the other image as ground truth. Then in the second layer we feed one of the two images generated by the first layer to G to generate fused image, and feed the other fused image to discriminator (D) as the ground truth image. The entire structure of the network is illustrated in Fig. 2.

Fig. 2

Framework of the two-layer GAN.

Our experiments have demonstrated that our proposed method is able to produce fused images with improved performance against the state-of-arts by preserving both the texture details from visible image and thermal radiation information from IR image. Figure 1 shows a preview of the fusion samples of ‘FusionGAN’ and our method.

The main contributions of this paper include the following three parts. First, we proposed a two-layer GAN, and designed the loss functions of G and D. Second, our model is an end-to-end model, which can automatically generate fused images without human participation. Third, we conducted experiments on the public data sets of IR and visible, and compared them qualitatively and quantitatively with the most advanced methods. The results of our proposed model show that the hidden target is clearly visible and the detailed texture information is rich.

Among the remaining sections of this paper, Section 2 reviews traditional and DL-based fusion methods. Section 3 presents the proposed approach, including the network structure, loss function and training flow of our model. Section 4 gives our experimental results, including qualitative illustration and quantitative analysis and comparison to the state-of-arts. Section 5 we conclude the paper and propose future work.

2 Related works

In this section, first we review the conventional and DL-based methods for IR and visible image fusion, and then introduce the network structure and characteristics of GAN.

2.1 Conventional methods

The conventional methods mainly include the multi-scale decomposition (MSD)-based methods [30, 31], sparse representation (SR)-based methods [32, 33], neural network-based methods [34, 35], and hybrid models-based methods [15, 36]. MSD-based methods decompose the source images into components of different scales, and each component represents sub-images of different scales. Then, the methods fuse the sub-images at different scales according to some rules and at last using inverse multi-scale transforms to gain the final result. SR-based methods aim to build an over-complete dictionary from a large number of high-quality natural images. And then, the method can use the dictionary to represent the source images. Most neural network based methods mainly adopt pulse-coupled neural network (PCNN) or its variants. PCNN-based methods make full use of its own biological characteristics to extract local details and obtain better final image when considering the gradient and phase information in advance.

There are many mixed model-based methods take the advantages of the above methods and try to avoid their respective disadvantages such as blurring boundaries and missing details. The hybrid MSD method [37], the hybrid MSD and SR method [38], the hybrid multi-wavelet and PCNN method [39], and the hybrid MSD neural network and SR method [40] have great contributions to the research. Zhao et al [30] proposed used multi-wavelet decomposed the image and used pulse number reconstructed the fused image which can effectively improve the entropy, standard deviation, and quality measure. Ying [40] proposed a method decomposed the image by shift-invariant dual-tree complex shearlet, and then used SR rule fused the image which exceled in both objective evaluation criteria and visual quality.

2.2 Deep learning method

In recent years, CNN have achieved great success in many computer vision applications. Prabhakar et al. [41] proposed a DL method for fusing static multi-exposure images. This method opens a novel way for information fusion by CNN. For IR and image fusion, Liu et al. [42] proposed a Siamese convolutional network to get the weight graph and used the image pyramid to solve the multi-scale problem. In addition, another fusion approach, “DenseFuse”, was proposed by Li et al. [43], using dense blocks to store more information from the middle layers. Obviously, DL-based methods have made a breakthrough in IR and visible image fusion. However, the approaches using CNN must meet a critical precondition that is: the ground truth should be available in advance. On this premise, the CNN techniques for IR and visible image fusion construct a deep model to determine the fused degree of every patch in the source images, and then calculate a weight map for generating the final image [42]. Li et al. [39] introduced Dense Net into the CNN for making full use of each convolution layer and achieved good results. However, the aspects like network architecture can still be further improved. Recently, Ma et al. [29] innovatively proposed a method with GAN, and formulated the fusion as an adversarial game between keep the thermal radiation information and texture detail information. Instead of pre-training, the method used IR and visible image patches to train the network.

2.3 Generative adversarial networks

The method of GAN [44] is based on minimax game and provides a simple and effective method for estimating target distribution and generating new samples. As illustrated in Fig. 3, GAN includes two adversarial components: G model and D model. G can generate a data distribution according to input noise (P_z) and D can estimate whether the generated sample is real sample data or not.

Fig. 3

The schematic diagram of the GAN.

The advantages and disadvantages of each type methods are listed in Table 1. Compared to the traditional fusion method and DL-based method, GAN has many advantages such as no need to use Markov chain or expand approximate reasoning network for training and sample generation [44]. In this paper we take these advantages propose a two-layer GAN network for fusing IR and visible images.

Table 1

Comparison of advantages and disadvantages of each method

Method	Advantages	Disadvantages
Conventional	Not need pre-trained	Blurring boundaries and missing details
CNN	Automatic extraction of image features	Need pre-trained
		Need A large number of samples
		Not an end-to-end models
FusionGAN	Only involves back propagation in network	Need trained
	An end-to-end models	Result loss IR image information
Ours	Simple network structure	Need trained
	An end-to-end models
	Result have both visible and IR information

3 Our fusion framework

A very difficult premise in this task is that there is no ground truth, so ‘FusionGAN’ selects visible image as the ground truth, which leads to lose many features of IR image in the fusion image during the adversarial process. In order to overcome the above problems, we propose a two-layer GAN method. In the first layer of the network, we feed IR image and visible image to G to generate fused images respectively, and use the other image as ground truth. Then in the second layer we feed one of the two images generated by the first layer to G to generate fused image, and feed the other to D as the ground truth. The structure of the two-layer GAN is illustrated in Fig. 2. It’s worth mentioning that, during the experiment, we have also changed the input of G in the second layer, i.e. taking the second fusion result of the first layer as the input of G, and then take the first fusion result as the ground truth. The final fusion results were quite similar, thanks to the symmetric features of the proposed structure.

3.1 The structure of G

Our model has three Gs that have the same structure as shown in Fig. 4.

Fig. 4

The network of the generator.

G is a simple five-layer CNN with 3×3 filters in the first four layers, and 1×1 filter in the last layer. The stride in each layer is set to 1, and there is no padding operation in convolution. In order to keep the details of the source image, only is the convolutional layer is introduced instead of downward sampling, which also keeps the input and output images the same size [45]. In addition, to avoid the problem of vanishing gradient, we follow the rules of deep convolutional GAN [46] for batch normalization and activation function. To overcome the sensitivity to data initialization, we used batch normalization in the first four layers that can make our model more stable and also help the gradients to back propagate to every layer effectively. For the activation function, we use leaky ReLU activation function in the first four layers, and the tanh activation function in the last layer.

3.2 The structure of D

Our model has three Ds of the same structure but different input images and ground truths, as illustrated in Fig. 2. The D’s structure is shown below in Fig. 5.

Fig. 5

The network structure of the discriminator.

D is a simple five-layer CNN, with 3×3 filters in the first four layers, and the stride set as 2 without padding. This is different from a generator network, for the discriminator is a classifier, which first extracts feature map from the input image and then classifies it. Therefore, it works in the same way as the pooling layer, setting the stride as 2. In order not to introduce noise into our model, we do not pad the input image. From the layer 1 to layer 4, we used the batch normalization layer. In addition, we used the Leaky ReLU activation function for the first four layers. The last layer is a linear layer for classification.

The core of GAN is the adversarial loss, we can use it to establish the adversarial relationship between G and D. Below we introduce the adversarial loss of our model in detail.

3.3 The loss function of generator

Since the network has three Gs and three Ds, we introduce their loss functions, respectively. Motivated by ‘FusionGAN’ [29], we split the functions into two components: the adversarial loss and content loss for the G, and the positive loss and negative loss for the D.

a) The first generator

The input of first G is IR image, and output is the fused image I_IRV. The loss function of the first G consists of two terms: $G L_{1} = L_{1}^{A} + L_{1}^{C}$ (1) Where $G L_{1}$ denotes the total loss of first G. The first term $L_{1}^{A}$ represents the adversarial loss defined as: $L_{1}^{A} = \frac{1}{N} \sum_{n = 1}^{N} {(D (I_{IRV}) - ℓ_{1})}^{2}$ (2)

Where N represents the number of fused images and D (I_IRV) denotes the discriminative results which reflecting the distinguish degree of the fused images from the visible images. ℓ₁ is the initial weight value that discriminator believe in fake data which generate from generator, which in our method is set as 0.8.

The second term $L_{1}^{C}$ represents the content loss, since the pixel intensities can represent the thermal radiation information. The area edges in the image are particularly important for target detection, so it is an important basis for the loss function, and we added the weight of the edge information in the formula. The edge point set of the fused image is $I_{IRV}^{'}$ , and the edge point set corresponding to $I_{IRV}^{'}$ in the IR image I_IR is $I_{IR}^{'}$ . The content loss is then defined as: $\begin{matrix} L_{1}^{C} = \frac{1}{S} {(| | I_{IRV} - I_{IR} | |)}_{F}^{2} + λ \frac{1}{N} \\ \sum_{i = 0}^{N} (I_{IRV}^{'} - I_{IR}^{'})^{2} \end{matrix}$ (3)

Where S represents the area of the input images, || · ||_F stands for the matrix Frobenius norm, and λ is the weight which we set as 2 in our method.

b) The second generator

The input of second G is the visible image, and G generate image I_VIR. Similar to the first G, the loss function of the second G consists of two terms: $G L_{2} = L_{2}^{A} + L_{2}^{C}$ (4) $L_{2}^{A} = \frac{1}{N} \sum_{n = 1}^{N} {(D (I_{VIR}) - ℓ_{2})}^{2}$ (5) $L_{2}^{C} = \frac{1}{S} {(| | \nabla I_{VIR} - \nabla I_{V} | |)}_{F}^{2}$ (6)

Where I_V represents the visible image and the explanation of these three formulas is the same as above for the first G. ∇ means the gradient operator.

c) The third generator

The input of third G is the fused image I_IRV which generated from the first G. And it generate the final image I_ff. The loss function of the third G consists of two terms: $G L_{3} = L_{3}^{A} + L_{3}^{C}$ (7) $L_{3}^{A} = \frac{1}{N} \sum_{n = 1}^{N} {(D (I_{ff}) - ℓ_{3})}^{2}$ (8) $\begin{matrix} L_{3}^{C} = \frac{1}{S} (| | {(I_{ff} - I_{IRV} | |)}_{F}^{2} + λ \frac{1}{N} \\ \sum_{i = 0}^{N} (I_{ff}^{'} - I_{IRV}^{'})^{2} + | | ζ \nabla I_{ff} - \nabla I_{VIR} | |)_{F}^{2} \end{matrix}$ (9)

The explanation of these three formulas is the same as above. ζ is a positive parameter which control the balance of the two terms, which in our method is set as 100. The larger this number, keep the more information of the second fused image.

3.4 The loss function of discriminator

The loss function of D contains two terms: the deviation degree of the ground truth image from the expectation and the deviation degree of fused image from the expectation. The three D have same loss function which consists of two terms $\begin{matrix} D L = \frac{1}{N} \sum_{n = 1}^{N} {(D (I_{gt}) - α)}^{2} + \frac{1}{N} \\ \sum_{n = 1}^{N} {(D (I_{f}) - β)}^{2} \end{matrix}$ (10)

Where I_gt represents the ground truth and I_f represents the fusion image. And α and β represent the parameters of the image I_f, and the image I_gt respectively. We set the parameter α as 0.8, since we regard the ground truth as the real image thus making it close to 1. On the contrary, we set the parameter β as 0.2, since we regard fused image as the fake image thus making it close to 0. This setting is to balance the loss function. We want the D can distinguish the fake data from the ground truth so through optimizing D to minimize D (I_f). The smaller $D L$ means that the fused image retains more details of the ground truth. The final image has comprehensive information form input images through the fusion process by the adversarial game.

4 Experiments and results

In our experiments, we selected 41 pairs of IR and visible images from the TNO database [47] which includes intensified visual, near-IR, and long-wave IR or thermal, night time imagery of different military relevant scenarios. All images have already been pre-aligned. These image pairs mostly have evident pixel intensity in IR images and abundant details in visible images. We divided 41 pairs of images into two parts: 31 pairs for training and 10 pairs for testing. Since 31 pairs of images are not enough to train a good model, we crop each image by setting the stride to 12, and each patch is of the same size 128×128. As a result, we combined 35,283 pairs of images together as the training data set.

4.1 Details of training

During the training, we set iteration number k as 10, step number as 2, and cropped the training images into 128×128 batches without overlapping. If the iteration is not over we fed the image batches to G, which then generates fused image batches, otherwise the whole training is over. Then we fed IR batches and fused image batches to D, which output the loss of G and the loss of D_. At the end of iteration, G generates the fused image. Now that we’ve trained one image patch, we should to train all the image patches. Figure 6 shows the GAN’s training process.

Fig. 6

Training process of GAN.

In the training process, we will train three pairs of G and D. The first pair takes visible images as input and IR images as ground truth. Through the training, the network with the first pair can generate fused image that contains more thermal radiation information. The second pair takes IR images as input and visible images as ground truth. After training, it can generate fused image that contains more texture details. The third pair takes the images that are generated by first network as input and the images that generates by second network as ground truth for training. During the process we use the method as that in [48] to optimize the discriminator. After these trainings, the final fused images will then retain salient details from both IR and visible images.

4.2 Objective evaluation metrics

The fusion task is difficult to objective evaluate for there is no standard ground truth. As a result, researchers take a reasonable way to apply several fusion metrics to make an overall evaluation [49]. In this paper, we used eight objective metrics to access the results. The metric entropy (EN) denotes the amount of information in the final result. Mutual information (MI) evaluates the the mutual information of result. Structural similarity (SSIM) accesses the average of structural similarity between the input images and final image. Spatial frequency (SF) [50] accesses the spatial frequency of the result. Standard deviation (SD) [51] accesses the contrast of the result which influences the visual attention. The sum of the correlation of differences (SCD) [52] is an independent index for judging the amount of information transmitted from source images to the fused image. The feature mutual information (FMI) [53] measures the mutual information between image features. QABF [54] is a local measure used to estimate the degree of retention of significant information in fused images. For all eight metrics, the larger value means the better result.

4.3 Subjective evaluation

To elaborate and compare the effects clearly, we obtained the fusion results by eight different methods of ten pairs as show in Figs. 7 and 8.

Fig. 7

The results for the first 5 groups. (a) IR image, (b) VIS Image, (c) NSST-PAPCNN, (d) NSCT, (e) CVT, (f) CSR, (g) DTCWT, (h) CBF, (i) LATLRR, (j) WLS, (k) CNN, (l) FGAN, (m) OURS. In order to compare clearly, we choose a red small region in each image, and then enlarge it.

Fig. 8

The results for the last 5 groups. (a) IR image, (b) VIS Image, (c) NSST-PAPCNN, (d) NSCT, (e) CVT, (f) CSR, (g) DTCWT, (h) CBF, (i) LATLRR, (j) WLS, (k) CNN, (l) FGAN, (m) OURS. In order to compare clearly, we choose a red small region in each image, and then enlarge it.

The comparing methods in our experiment include NSST-PAPCNN [55], nonsubsampled contou- rlet transform (NSCT) [56], curvelet transform(CVT) [57], convolutional sparse representation (CSR) [58], dual-tree complex wavelet transform (DTCWT) [59], cross bilateral filter(CBF) [60], Latent Low-Rank Representation(LATLRR) [61], weighted least squa- re(WLS) [62], CNN-based fusion [63], and GAN based method (‘FusionGAN’) [29]. We gained the results from the codes of the authors.

The first two rows in Figs. 7 and 8 are IR images and visible images, and the last row is the final results of our method. Overall, the results show that all the methods can fuse the visible image and IR image well to some extent. However, through the results we can see that, comparing to other methods, ‘FusionGAN’ and ours make the target area (such as buildings, people and cars) more prominent in the fused images, which is conducive to automatic target detection and localization. This could be attributed to the fact that ‘FusionGAN’ and ours are able to retain more IR information, while other comparing methods focus more on exploiting the texture details in the visible images.

From comparing our method and the method of ‘FusionGAN’, we can see that our final images contain slightly more plentiful details, and they are better suitable to visual perception, as shown in the red boxes in Figs. 7 and 8. For example, the solider in the third column of Fig. 7 is presented more clearly by ours than that by ‘FusionGAN’. In the second column of Fig. 8, the hand of the umbrella bearer is fused more appropriately and more clearly by ours than ‘FusionGAN’. And in the third column of Fig. 8, the corner of the roof highlighted in the box fused by ‘FusionGAN’ is fuzzy, while our result is sharper. This demonstrates our method’s excellent performance in terms of simultaneously retaining IR image information and visible image details information.

To quantitative compare, we evaluated all methods through the above-mentioned eight metrics. We plotted the results show as Fig. 9, and listed the average of ten fused results for eight metrics in Table 2.

Fig. 9

Quantitative values of eight metrics for TNO data set.

Table 2

The average values of ten fused images form TNO for the eight metrics

Methods	EN	MI	QABF	FMI	SSIM	SD	SF	SCD
NSST_PAPCNN	7.062735	14.12547	0.349003	0.893138	0.6545	41.70837	0.487205	1.44104
NSCT	6.67294	13.34588	0.442946	0.899258	0.678554	30.47823	0.559955	1.56964
CVT	6.684161	13.36832	0.4151	0.895826	0.664334	30.46991	0.515316	1.55008
CSR	7.308307	14.61661	0.441369	0.867851	0.607427	49.43395	0.502694	1.06805
DTCWT	6.648422	13.29684	0.410618	0.894699	0.664047	30.05842	0.562504	1.54710
CBF	6.96852	13.93704	0.354673	0.857701	0.547844	37.91702	0.511339	1.27858
LATLRR	6.618689	13.23738	0.408749	0.891143	0.730806	30.45902	0.583359	1.60302
WLS	6.887689	13.77538	0.413503	0.88595	0.674388	38.43261	0.527757	1.69113
CNN	7.195607	14.39121	0.44262	0.89764	0.662346	48.1034	0.498179	1.63320
FusionGAN	6.49529	12.99058	0.21183	0.880279	0.639393	29.35145	0.630014	1.32043
Ours	7.321885	14.25377	0.443773	0.871807	0.726497	49.78707	0.630342	1.56132

In Table 2, the best values for each metric are presented in bold face. They show that our method achieves the best performance in EN, QABF, SD, and SF. For other metrics, the performance of our method is not far from the best. High EN and SD values indicate that our fused images have higher contrast and more abundant information, while high QABF means our fused image is superior in conspicuousness. Also, a high SF indicates that our method can generate the images contain more texture details. However, our method has slightly lower SCD, which will be our future work to optimize the network structure and loss function further.

4.4 Results on INO database

In order to further verify our algorithm, we tested our method and other ten comparison methods on INO database, and selected ten pairs of visible and IR images from each of the four videos for qualitative and quantitative comparison. Fig. 10 shows the fusion results of the four selected videos, and listed the average of ten fused results for eight metrics in Table 3.

Fig. 10

INO fusion results. (a) IR image, (b) VIS Image, (c) NSST-PAPCNN, (d) NSCT, (e) CVT, (f) CSR, (g) DTCWT, (h) CBF, (i) LATLRR, (j) WLS, (k) CNN, (l) FGAN, (m) OURS.

Table 3

The average values of 4 fused images form INO for the 8 metrics

Methods	EN	MI	QABF	FMI	SSIM	SD	SF	SCD
NSST_PAPCNN	7.275915	14.04202	0.386598	0.898114	0.366433	34.97864	0.716401632	1.25178864
NSCT	7.363805	14.6824	0.465309	0.921793	0.427563	49.45925	0.575094	1.556713232
CVT	7.120348	14.64961	0.457866	0.911596	0.452442	40.86526	0.152349	1.108957883
CSR	7.0702	13.17622	0.462721	0.918571	0.414732	29.86514	0.67367	1.348763543
DTCWT	7.067095	13.11118	0.467739	0.928704	0.420836	29.7101	0.701462	1.351161322
CBF	6.962802	13.09559	0.452955	0.895397	0.491911	26.38268	0.705848	1.443358454
LATLRR	7.095435	13.12766	0.479159	0.918198	0.432656	30.14905	0.665809	1.369914188
WLS	7.359635	14.38894	0.450735	0.920404	0.444697	29.06411	0.539127	1.573150592
CNN	7.17201	14.04793	0.451882	0.895658	0.44378	38.26709	0.57353	1.628561598
FusionGAN	6.924062	12.85112	0.356739	0.918253	0.457345	26.49029	0.426476	1.379237387
Proposed	7.48461	14.00285	0.496812	0.883055	0.592627	52.95247	0.730252	1.276013388

As the result, you can see that the other ten methods all retain texture information well but lack thermal radiation information. Only our method can preserve the distribution of thermal radiation in IR images, such as the pixel intensity of human body areas, car tires and trees. Table 3 shows the quantitative comparison results of the eight indicators, among which our method still has the best EN, SSIM, QABF, SD, and SF, and the average value of these evaluation indicators is the largest among the ten comparison methods. This also demonstrates the robustness of our method. In addition, we also provide the elapsed time comparison of the different fusion methods in Table 4, CNN, ‘FusionGAN’ and Our method are pre-trained models that run on the GPU while all the other methods are performed on CPU. Each value denotes the mean of run times of a certain method on a dataset and our method can achieve comparable efficiency with the other ten methods.

Table 4

Run time comparison of ten methods on the datasets from the TNO and the INO database. (unit: second)

Methods	TNO	INO
NSST_PAPCNN	3.5×10^-1	3.1×10^-1
NSCT	2.78×10^-1	2.14×10^-1
CVT	1.24	5.43×10^-1
CSR	2.68×10^-1	2.24×10^-1
DTCWT	4.18×10^-1	3.25×10^-1
CBF	3.12×10^-1	2.18×10^-1
LATLRR	2.88×10^-1	1.48×10^-1
WLS	3.18×10^-1	1.88×10^-1
CNN	2.15	1.98
FusionGAN	2.543×10^-1	3.2×10^-2
Proposed	2. 3×10^-1	2.89×10^-2

5 Conclusions

Inspired by ‘FusionGAN’, we propose a two-layer GAN for IR and visible images fusion. The proposed method has excellent performance in terms of simultaneously retaining IR image information and visible image details information. Our experiments demonstrate that compared to ‘FusionGAN’ and other existing approaches, our fusion results can highlight salient information in the images such as potential targets more clearly. It is important for target detection and other computer vision applications. Through the quantitative comparisons also indicate that our method not only gain better visual effects, but also retain more details existing from the source images.

In future, we will optimize the structure and the loss functions of our framework so that the fused results have more texture details and target radiation information.

Footnotes

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant 61572392 and by the National and Local Funds for New Networks and Measurement and Control Laboratories under Grant GSYSJ2017001.

References

Wanga

Zhishe

, Xub

Jiawei

, Jiang

Xiaolin

and Yan

Xiaomei

, Infrared and image fusion via hybrid decomposition of NSCT and morphological sequential toggle operator. Optik, 2020.

Zhao

Cheng

, Huang

Yongdong

and Qiu

Shi

, Infrared and image fusion algorithm based on saliency detection and adaptive double-channel spiking cortical model, Infrared Physics & Technology, 2019.

Jin

Xin

, Jiang

Qian

, Yao

Shaowen

, et al. Infrared and visual image fusion method based on discrete cosine transform and local spatial frequency in discrete stationary wavelet transform domain, Infrared Physics & Technology (2018), 88.

, Ma

and Li

, Infrared and image fusion methods and applications: A survey, Inf. Fusion 45 (2018), 153–178.

Guo, Chen, Li, et al. Weighted sparse representation multi-scale transform fusion algorithm for high dynamic range imaging with a low-light dual-channel camera, Optics Express, 2019.

Jun

, Chen

, et al., Infrared and image fusion based on target-enhanced multiscale transform decomposition, Information Ences 508 (2020), 64–78.

Yubin

, Mei

, Hao

, et al., Multi-exposure image fusion based on tensor decomposition and convolution sparse representation, Opto-Electronic Engineering, 2019.

Xinxiang

L.I.

, Longbo

, Lei

, et al., Image fusion method based on convolutional sparse representation and morphological component analysis, Intelligent Computer and Applications, 2019.

Mustafa

H.T.

, Yang

and Zareapoor

, Multi-scale convolutional neural network for multi-focus image fusion, Image and Vision Computing 85(May) (2019), 26–35.

10.

Yang

, Nie

, Huang

, et al., Multi-level features convolutional neural network for multi-focus image fusion, IEEE Transactions on Computational Imaging (2019), 1–1.

11.

Kong

, Lei

and Zhao

, Adaptive fusion method of light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization, Infrared Phys. Technol 67 (2014), 161–172.

12.

WANG Wei-zhe and DAI Ye-yong, Improvement of the edge fusion algorithm for subspace of remote sensing image, Ence & Technology of West China, 2015.

13.

Zihui

, Yuxing

, Jianlin

, et al., Image Fusion of Infrared and Images Based on Saliency Map, Infrared Technology, 2019.

14.

Zhai-Sheng

, Dong-Ming

, Ren-Can

, et al., Infrared and image fusion using residual network and visual saliency detection, Journal of Yunnan University (Natural Sciences Edition) (2019).

15.

Latreche

, Saadi

, Kious

, et al., A novel hybrid image fusion method based on integer lifting wavelet and discrete cosine transformer for visual sensor networks[J], Multimedia Tools and Applications 78(8) (2019), 10865–10887.

16.

, Zhou

, Wang

and Zong

, Infrared and image fusion based on visual saliency map and weighted least square optimization, Infrared Phys. Technol. 82 (2017), 8–17.

17.

Wei, Tan, Huixin, et al. Infrared and image perceptive fusion through multi-level Gaussian curvature filtering image decomposition, Applied Optics, 2019.

18.

Farahnakian

, Poikonen

, Laurinen

, et al., and Infrared Image Fusion Framework based on RetinaNet for Marine Environment. 22th International Conference on Information Fusion (FUSION). IEEE, 2020.

19.

Liu

, Chen

, Peng

and Wang

, Multi-focus image fusion with a deep convolutional neural network, Inf. Fusion 36 (2017), 191–207.

20.

Liu

, Chen

, Cheng

, Peng

and Wang

, Infrared and image fusion with convolutional neural networks, Int. J. Wavelets Multiresolution Inf. Process 16 (2018), 1850018.

21.

Liu

, Chen

, Ward

R.K.

and Wang

Z.J.

, Image fusion with convolutional sparse representation, IEEE Signal Process. Lett. 23(12) (2016), 1882–1886.

22.

Huang

, Liu

, van der Maaten

and Weinberger

K.Q.

, Ieee, Densely connected convolutional networks, in: 30th IEEE/CVF Conference on Comuter Vision and Pattern Recognition (CVPR), IEEE Conference on Computer Vision and Pattern Recognition (2017), 2261–2269.

23.

Singh

and Kaur

, Fusion of medical images using deep belief networks[J], Cluster Computing 23(2) (2020).

24.

Singh, Kaur, Singh. Remote sensing image fusion using fuzzy logic and gyrator transform[J], Remote Sensing Letters, 2018.

25.

Kaur

and Singh

, Multi-modality medical image fusion technique using multi-objective differential evolution based deep neural networks[J], Journal of Ambient Intelligence and Humanized Computing (2020), 1–11.

26.

Singh

, Deepak Garg and Husanbir Singh Pannu, Efficient Landsat image fusion using fuzzy and stationary discrete wavelet transform[J], Image Science Journal, 2017.

27.

Liu

, Chen

, Cheng

, Peng

and Wang

, Infrared and visible image fusion with convolutional neural networks, International Journal of Wavelets Multiresolution and Information Processing 16(3) (2018).

28.

Hui Li, Xiao-Jun Wu and Josef Kittler, Infrared and visible image fusion using a deep learning framework. arXiv preprint arXiv:1804.06992, 2018.

29.

Jiayi

, Wei

, Pengwei

, Chang

and Junjun

, FusionGAN: A generative adversarial network for infrared and image fusion, Inf. Fusion 48 (2018), 11–26.

30.

, Ma

, Yong

, et al., Fast multi-scale structural patch decomposition for multi-exposure image fusion, IEEE Transactions on Image Processing 2020, PP (99):1–1.

31.

, Yang

and Hu

, Performance comparison of different multi-resolution transforms for image fusion, Inf. Fusion 12(2) (2011), 74–84.

32.

Liu

, Chen

, Wang

, et al., Deep learning for pixel-level image fusion: Recent advances and future prospects, Information Fusion 42 (2018), 158–173.

33.

, Yin

and Fang

, Group-sparse representation with dictionary learning for medical image denoising and fusion, IEEE Trans. Biomed. Eng. 59(12) (2012), 3450–3459.

34.

Rout

, Nahak

, Priyadarshinee

, et al., A Deep Learning Approach for SAR Image Fusion. 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT). IEEE, 2020.

35.

Xiang

, Yan

and Gao

, A fusion algorithm for infrared and visible images based on adaptive dual-channel unit-linking pcnn in nsct domain, Infrared Phys. Technol. 69 (2015), 53–61.

36.

, Zhou

, Wang

and Zong

, Infrared and visible image fusion based on visual saliency map and weighted least square optimization, Infrared Phys. Technol. 82 (2017), 8–17.

37.

and Yang

, Hybrid multiresolution method for multisensor multimodal image fusion, IEEE Sens. J 10 (2010), 1519–1526.

38.

Liu

, Yin

, Fang

and Chai

, A novel fusion scheme for and infrared images based on compressive sensing, Opt. Commun. 335 (2015), 168–177.

39.

Wang

and Gong

, A multi-faceted adaptive image fusion algorithm using a multi-wavelet-based matching measure in the PCNN domain, Appl. Soft Comput. 61 (2017), 1113–1124.

40.

Yin

, Duan

, Liu

and Liang

, A novel infrared and image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation, Neurocomputing 226 (2017), 182–191.

41.

Prabhakar

K.R.

and Babu

R.V.

, Ghosting-free multi-exposure image fusion in gradient domain, IEEE International Conference on Acoustics, IEEE, 2016.

42.

Liu

, Chen

, Cheng

, Peng

and Wang

, Infrared and visible image fusion with convolutional neural networks, Int. J. Wavelets Multiresolution Inf. Process 16(3) (2018), 1850018.

43.

and Wu

X.-J.

, DenseFuse: A Fusion Approach to Infrared and Images, ITIP 28 (2019), 2614–2623.

44.

Goodfellow

I.J.

, Pouget-Abadie

, Mirza

, Bing

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative Adversarial Nets. Proc. 27th Int. Conf. Neural Inf. Process. Syst, 2 (2014), 2672–2680.

45.

and Koltun

, Multi-scale context aggregation by dilated convolutions, arXiv:1511.07122v1 (2015).

46.

Radford

, Metz

and Chintala

, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv:1511.06434v1 (2015).

47.

Toet

and Franken

E.M.

, Perceptual evaluation of different image fusion schemes, Displays 24(1) (2003), 25–37. https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029.

48.

Kingma

D.P.

and Ba

, Adam: a method for stochastic optimization, arXiv:1412.6980v1 (2014).

49.

Liu

, Blasch

, Xue

, Zhao

, Laganiere

and Wu

, Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: A comparative study, IEEE Trans. Pattern Anal. Mach. Intell 34 (2011), 94–109.

50.

Eskicioglu

A.M.

and Fisher

P.S.

, Image quality measures and their performance, IEEE Trans. Commun. 43 (1995), 2959–2965.

51.

Xing Su-xia, Chen Tian-hua and Li Jing-xian, Image fusion based on regional energy and standard deviation, International Conference on Signal Processing Systems. IEEE, 2010.

52.

Aslantas

and Bendes

, A new image quality metric for image fusion: The sum of the correlations of differences, AEU Int. J. Electron. Commun. 69 (2015), 1890–1896.

53.

Haghighat

and Razian

, Masoud Fast-FMI: Non-reference image fusion metric. In Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), Astana, Kazakhstan, 15–17 Octorber 2014; pp. 1–3.

54.

Piella

and Heijmans

, A new quality metric for image fusion[C]. International Conference on Image Processing. IEEE, 2003.

55.

Ming Yin, Xiaoning Liu, Yu Liu* and Xun Chen, Medical Image Fusion With Parameter-Adaptive Pulse Coupled Neural Network in Nonsubsampled Shearlet Transform Domain, IEEE Transactions on Instrumentation and Measurement, in press, 2018.

56.

Zhang

and Guo

B.-L.

, Multifocus image fusion using the nonsubsampled contourlet transform, SIGPR 89 (2009), 1334–1346.

57.

Nencini

, Garzelli

, Baronti

and Alparone

, Remote sensing image fusion using the curvelet transform, Inf. Fusion 8 (2007), 143–156.

58.

Yu Liu, Xun Chen, Rabab Ward and Z. Jane Wang, Image fusion with convolutional sparse representation, IEEE Signal Processing Letters, 23(12) (2016), 1882–1886.

59.

Lewis

J.J.

, O’Callaghan

R.J.

, Nikolov

S.G.

, Bull

D.R.

and Canagarajah

, Pixel-and region-based image fusion with complex wavelets, Inf. Fusion 8 (2007), 119–130.

60.

Shreyamsha Kumar

B.K.

, Image fusion based on pixel significance using cross bilateral filter Signal, Image and Video Processing (2013), pp. 1–12. (doi:10.1007/s11760-013-0556-9)

61.

, Wu

X.J.

and Kittler

, Infrared and image fusion using a novel deep decomposition method. 2018.

62.

, Zhou

, Wang

, et al., Infrared and image fusion based on visual saliency map and weighted least square optimization, Infrared Physics & Technology 82 (2017), 8–17.

63.

Yu Liu, Xun Chen, Juan Cheng, Hu Peng and Zengfu Wang, Infrared and image fusion with convolutional neural networks,018:, International Journal of Wavelets, Multiresolution and Information Processing 16(3) (1850), 1–20.