Abstract
Infrared and visible image fusion refers to the technology that merges the visual details of visible images and thermal feature information of infrared images; it has been extensively adopted in numerous image processing fields. In this study, a dual-tree complex wavelet transform (DTCWT) and convolutional sparse representation (CSR)-based image fusion method was proposed. In the proposed method, the infrared images and visible images were first decomposed by dual-tree complex wavelet transform to characterize their high-frequency bands and low-frequency band. Subsequently, the high-frequency bands were enhanced by guided filtering (GF), while the low-frequency band was merged through convolutional sparse representation and choose-max strategy. Lastly, the fused images were reconstructed by inverse DTCWT. In the experiment, the objective and subjective comparisons with other typical methods proved the advantage of the proposed method. To be specific, the results achieved using the proposed method were more consistent with the human vision system and contained more texture detail information.
Keywords
Introduction
Nowadays, infrared and visible image fusion acts as one of the critical factors in image processing [1]. It has been extensively adopted in numerous applications (e.g., surveillance, modern military and object detection) [2]. The source images applied for image fusion originate from different sensors. The infrared sensors can capture thermal radiation. They are less susceptible to light and exhibit high performance at night, whereas it will lose textural details. The visible sensors are capable of capturing reflected light and normally obtaining sufficient texture information and high-resolution images [3, 4]. However, in certain scenarios (e.g., low-brightness, heavy fog, and smog), the visible imaging sensor rarely captures effective information. Infrared and visible image fusion aims to combine the complementary information from different sensors. Accordingly, the fused images combine the vital feature information and texture details simultaneously.
As fueled by the advancement of image processing and analysis theory technology, considerable image processing methods have been developed over the past decades. These methods are commonly classified into spatial domain methods and transform domain methods [5, 6]. The spatial domain methods directly operate pixels of source images by following different strategies, which are simple and time-efficient, whereas these methods may loss some information for considering less about the differences of infrared and visible images [7]. The transform domain methods are the most commonly used methods to address the image fusion issues and achieve effective fused results, which is to transform source images of spatial domain into other domain via several specific transformations. The transform domain methods are split into several categories [8], namely, multi-scale transform (MST) [9–11],sparse representation (SR) [12–19], etc. MST are usually employed in image fusion, which cover pyramid contourlet based methods (e.g., ratio of low-frequency pyramid (RP) [20] and Laplacian pyramid (LP) [21, 22]), curvelet transform (CVT) [23], contourlet transform based methods (e.g., nonsubsampled contourlet transform (NSCT) [24, 25]), wavelet transform based methods (e.g., discrete wavelet transform (DWT) [26], and dual-tree complex wavelet transform (DTCWT) [27], etc. There exist some variants of sparse representation-based methods, covering convolutional form (CSR) [28] and adaptive form (ASR) [15], etc. However, they usually only apply “averaging rule” and “variance rule” to fuse base layers or detail layers, so the information of source images cannot be transformed into the fused images, thereby decreasing the contrast of fused images.
Over recent years, many image fusion methods have been proposed based on above technologies, nevertheless such methods may lose details information or consume too much time. To remedy these defects, this study proposed a hybrid method based on DTCWT and CSR. Two things are noteworthy about the principle of fusion [12]: (1) representing the image/signal effectively and appropriately; (2) combining the complementary information originating from source images captured by multi-sensors into a joint fused image. In our fusion framework proposed here, the DTCWT was first adopted to decompose the source images into high-frequency component and low-frequency components. Subsequently, the convolutional sparse representation (CSR) was employed to represent the low-frequency component for better detail-preserving, and the low-frequency component was fused following choose-max rule. Next, the high-frequency bands were optimized by guided filtering (GF), consequently the detailed information of high-frequency component was preserved better than other methods. Finally, the fused image was reconstructed via the inverse DTCWT. The whole process was split into three steps, namely, decomposition, fusion, and reconstruction. The main contributions of this work are presented below:
1) A hybrid method was proposed for infrared and visible image fusion, capable of capturing high-quality fused images. The experimental results demonstrated that this method outperforms other typical approaches in visual perception and objective measurement.
2) The proposed method integrates DTCWT, CSR and GF and remedies some defects of other methods. It also exhibits good effect in detail information preservation and efficient in computation, enhancing the performance of the method.
3) Effective strategies were presented for fusing low-frequency band and high-frequency bands. For the characteristic of CSR and GF, the generated images can tackle down two problems, namely, image representation and complementary information preservation.
The rest of this paper is structured as follows. In section 2, we introduce the related background and techniques briefly. In section 3, the scheme of image fusion is proposed and explained in detail. In section 4, we conduct comparative experiments, and the results show the superiority of our approach. Finally, section 5 summarizes this whole paper.
Related works
In this section, we introduce the background of our proposed method and the utilized technologies. Besides, we present the recent methods and their characteristics. To overcome drawbacks of these methods, three important technologies are applied.
Background and literature review
In recent years, the transform domain based algorithms are pretty popular in image fusion processing. For example, MST and SR based methods have been extensively adopted. Nencini et al. utilized curvelet transform (CVT) [29] to fuse remote sensing images; such method can generate the fused images exhibiting abundant details and edge information. However, CVT conducts down-sampling and up-sampling, leading to distortion and spectral aliasing in the final fused image. Ma et al. proposed an infrared and visible image fusion method based on gradient transfer and total variation minimization (GTF) [3, 30], capable of enhancing the applicability of high-precision registration and generating the fused images exhibiting abundant texture details. However, this method only significantly impacts images exhibiting high brightness. Recently, the fusion algorithms based on SR have also had wide applications in image processing; they exploit sliding window technology to reduce block artifacts. However, they are limited to save texture detail information. To solve this drawback, CSR is derived from SR. The result of CSR method is single-valued and optimized over the entire image rather than a local image patch; it can then optimize the entire image and preserve more details than SR. However, CSR is relatively time-consuming during the whole image processing.
For overcoming the above disadvantages, our method applies DTCWT, GF, and CSR theory. To acquire more texture detail information, CSR was adopted to represent the low-frequency bands of images. Meantime, since the result of CSR is single-value, the low-frequency component could be optimized by it. To enhance computational efficiency, the appropriate MST method was employed. The dual-tree wavelet transform complied with individual filter bank [8], thereby making the image processing more efficient. Moreover, it exhibits directional selectivity and preserves more texture details. Combined with DTCWT and CSR, the proposed method here can considerably reduce the impact of low computational efficiency caused by CSR in low-frequency band. The specific principles of these three technologies are described in detail as follows.
Dual-tree complex wavelet transform
Compared with DWT, DTCWT has excellent directional selectivity and high-efficiency [9, 31]. At the same time, the DTCWT is an over-complete wavelet transform that is consistent with the human visual system [27]. The DTCWT is an enhancement of the traditional wavelet transform. Thus, it has been widely used in the field of image processing.
DTCWT adopts two separated real DWTs with a 90° phase difference. The real part is provided by one DWT, and the imaginary part of the transform is presented by another DWT. Each tree acts on rows and columns to require dual-tree structure. No matter what the decomposition level of 2-D DTCWT is, the overall redundancy is 4:1. For each level of decomposition, it produces two low-pass sub-bands and six high-pass sub-bands. Fig. 1 shows the impulse responses of the 2-D DTCWT filters for six different directional high-pass sub-bands (-75°, -45°, -15°, 15°, 45°, 75°). Consequently, DTCWT has three more directions than DWT, which can improve the image quality, the accuracy of decomposition and reconstruction, and the ability of retaining image details. More details refer to the comprehensive study published by I.W. Selesnick [32].

The impulse responses for 2-D DTCWT transform. (a)the six dictionary sub-bands of the real part: -75°, -45°, -15°, 15°, 45°, 75°. (b)The six dictionary sub-bands of the real part: -75°, -45°, -15°, 15°, 45°, 75°.
As an enhancement technology, guided image filtering (GF) has been proposed to implement edge-preserving [33]. The guided filter is defined as a local linear model [34] as shown in (1-4).
The coefficients a
k
and b
k
are indicated as follows:
SR has been applied in image processing tasks, and it addresses the sparsity of the training signals. The principle of it is based on constituting a sparse vector with only a few nonzero elements to represent the signals by the dictionary, and the dictionary can be learned by the K-SVD algorithm [35]. For a set of training signals
Sparse representation (SR) performs on a local image patch. And it can not optimize an entire image. In addition, SR is limited in local patch manner and detail preservation.
Consequently, convolutional sparse representation (CSR) is considered as an alternative representation structure. The image S is modeled as the sum over a series of convolutions of coefficient maps X
m
and relevant dictionary filters d
m
. And the sparse coefficient maps X
m
can be obtained by following CSR model [28, 38]:
In this paper, CSR is regarded as an efficient form that can solve the existing problem of SR and optimize the whole image due to the result of single-value. Meanwhile, the principle of convolutional form is based on the entire image so that CSR can preserve more texture details during image fusion.
In this section, we present the structure and rule of the proposed image fusion method in detail. For the low-frequency bands and high-frequency bands, we apply CSR and GF strategy respectively. The schematic diagram of our method is summarized in Fig. 2. The essential steps and fusion framework are introduced in the following subsections.

The schematic diagram of our model.
The DTCWT decomposition works on each input image respectively to acquire their wavelet subimages [31]. In this paper, input images S1 and S2 are decomposed into low-frequency bands S1 _ low and S2 _ low, high-frequency bands S1 _ high and S2 _ high. The variation range of decomposition level is from 1 to 4, and the principle of decomposition is presented in section 2.
Fusion of high-frequency bands
The high-frequency bands S1 _ high and S2 _ high are processed by guided image filtering to obtain S1g _ high and S2g _ high, respectively. The fused high-frequency bands S1 _ high is taken into equation (1) as the input image P i to obtain S1g _ high, and S2g _ high also can be obtained in the same way. The detailed calculation process is shown in Section 2.2. This enhancement technology implements the edge-preserving of the fused image. In this paper, numerous experiments demonstrate that when the values of ε (blur degree of the guided filtering) and r (filter size) respectively are 10-7 and 3, the best performance can be achieved. Then we adopt the “choose-max” strategy to get the fused high-frequency bands S high .
Fusion of low-frequency bands
The low-frequency bands S1 _ low and S2 _ low are represented by CSR separately. The sparse coefficient maps Ck,m, which can be obtained by calculating the following CSR model.
Besides, the window-based averaging strategy for A
k
(x, y) makes this method save more structure information so that we can obtain the optimized activity level map:
When r’ is too small, it may be incapable of representing characteristics of the image. However, when r’ is too large, it will lead to a great increase in complexity. Therefore, it is significant to employ a suitable r’ for image fusion. In this paper, through the verification of numerous experiments, this method works best when r’ equals 3. Then, the fused coefficient maps are required by implementing the choose-max strategy as below:
The final fusion result of low-frequency band S
low
is reconstructed by learning dictionary and convolutional sparse vectors:
In this paper, we comply with the corresponding inverse DTCWT to absolutely reconstruct the fused image. And the inverse DTCWT process can be easily implemented because it is the opposite of DTCWT decomposition. The experimental results and metrics of fused images are shown in the next section.
Experiments
In order to verify the superiority of our method, we conduct internal comparison experiments and external comparison experiments. We choose the optimal parameters via the comparison of fused results under different parameters. And the external experiments verify that our results are better than the other eight methods.
Experimental settings
We pick on multiple pairs of infrared and visible images to certificate the efficacy and benefit of the proposed method as shown in Fig. 3. All source images are from the public datasets: https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029 and http://www.imagefusion.org/. Four metrics are adopted to evaluate the performance of image fusion. These experiments are conducted in MATLAB on a computer with a 2.5 GHz CPU and 8 GB RAM. The dictionary used in CSR is learned from natural image patches by the K-SVD method [39]. At the beginning of the experiment, we set N (the decomposition level) =4, ε (the blur degree of the guided filtering) =10-6, r (the guided filter size) =7 as the initial values. However, they are not the optimal ones, we conduct the internal comparative experiments to select the most appropriate values later. We study the effect of decomposition level in terms of image fusion efficiency and four metrics.

Four pairs of source images: The first row shows the infrared image, and the second row shows the visible image.(“Tank” a-infrared, b-visible; “Leaf” c-infrared, d-visible; “People” e-infrared, f-visible; “Street” g-infrared, h-visible)
In the external experiments, our method is compared with eight representative image fusion methods, namely, ratio of low-pass pyramid (RP) [20], dual-tree complex wavelet transform (DTCWT) [27], curvelet transform (CVT) [23], guided filtering fusion(GF) [33], adaptive sparse representation (ASR) [15], gradient transfer fusion (GTF) [3], convolutional sparse representation (CSR) [28], and DTCWT-CSR.
For all comparison methods, all the parameters are set to the default values by reference in the original paper. Besides, for DTCWT-SR, the size of the dictionary used in SR is united with CSR in our paper. RP is the most classic and early method. ASR and CSR are based on sparse methods. DTCWT is a pretty popular MST method. GTF and GF are recent and excellent methods. DTCWT-SR method can outperform plenty of typical fusion methods in recent surveys. Thus, it is necessary to compare it with the proposed method on the source images to verify the superiority of our approach. We carry out the subjective and objective comparisons between the eight comparative methods and our method.
To evaluate the fusion property of the different methods objectively, many assessment methods are proposed in recent years. None of these metrics can appraise the quality of fused images accurately, so it is essential to employ several metrics simultaneously [40]. In this paper, four metrics are adopted, namely, entropy (EN) [19], mutual information (MI) [41], standard deviation (SD) [19], and visual information fidelity (VIF) [42]. The value of each metric is larger, the image fusion effect is better.
1. Entropy (EN). The EN is the average amount of information contained in the generative image. The larger the value is, the greater the amount of information it contains. It is calculated as follows:
where i denotes the gray level of an image (in this paper, i=256), and p i means the distribution of gray scale.
2. Mutual information (MI). The quality metric MI is employed to measure the similarity between the fused image and the input image in the distribution of the gray scale. Thereby, it measures the degree of preservation that the information originates from input images in the generated image. It is defined as follows:
3. Standard deviation (SD).The SD reflects the degree of dispersion of image grayscale relative to average grayscale.
4. Visual information fidelity (VIF). The VIF shows the performance for image quality prediction, which is in accordance with the principle of the human vision system. It is calculated as follows:
In this subsection, we study the parameters (level N, radius r, and regularization ε) in our method. To investigate the effect of the decomposition level for image fusion, we conduct a comparative experiment. Fig. 3 shows four sets of source infrared images and visible images. The level of DTCWT decomposition is set from 1 to 4, respectively. And the fused images of different levels are shown in Fig. 4. There is little difference with different levels in subjective visual perception. Table 1 lists the objective assessments with different decomposition levels for image fusion processing. Under different decomposition levels, the maximum value of each objective metric (SD, EN, MI, VIF) of the four pictures is indicated in bold. As we can see that N is 1 or 2, the metrics tend to be larger. Table 2 shows the computational efficiency of different levels. When we increase the value of N, the running time is significantly reduced. From the result, we can know that larger decomposition layer leads to higher computational efficiency. However, N has a little impact on these four metrics of the fused image. Considering both the computational cost and quality metrics, 4 is the most appropriate value of N for DTCWT decomposition in this work.

The fused images under different decomposition levels (a-“Tank”, b-“Leaf”, c-“People”, d-“Street”). From top to bottom, namely N=1, N=2, N=3, N=4.
Quantitative assessment under different levels
The running time (s) under different levels
The parameter r and ε are the key value that influence the image fusion result. To investigate the effect of r and ε, we conduct another set of comparative experiments. When r changes from 1 to 7, the values of metrics are shown in Fig. 5. As we can see, the values of EN and MI are lager when r is 2, whereas the values of SD and VIF are larger when r is 3. However, when r is 2, the value of SD is much lower. Therefore, the most suitable value of r is set to 3. Fig. 6 shows the change of objective metrics with different ε, and it is set from 10-1 to 10-10. As we can see, with the increase of ε, the curve is on the rise and eventually tends to stabilize. Thus, The value ε is set 10-7 in our work.

The change of objective metrics with different r values.

The change of objective metrics with different ε values.
According to the above internal experiments, the parameters in our method are empirically set to N (the decomposition level) =4, ε (the blur degree of the guided filtering) =10-7, and r (the filter size) =3. Fig. 3 shows the infrared images and visible images in four scenes, namely, “Tank”, “Leaf”, “People”, “Street”. We choose two representative scenes for comparison in detail. In Fig. 7-8, results of fusion are obtained to assess the subjective evaluation of fusion performance by eight compared methods (RP, DTCWT, CVT, GF, ASR, GTF, CSR, DTCWT-SR) and the proposed method. And all the parameters used in other comparative methods are set to the optimal values. For apparent comparisons, the green box region is selected to enlarge twice and put in the white box. Table 3-4 show the objective metrics of different methods. Four metrics are employed to quantitatively evaluate the fused images, namely, EN, MI, SD, and VIF, and the optimal values of objective metrics are indicated in bold. It shows that the proposed method almost has larger values of the objective evaluation compared with the other eight methods. The value of EN means that the fused images contain the amount of information, thereupon, the generated images based on our method get a greater amount of information. Due to larger value of MI, our method has a greater degree of original information preservation from source images. Besides, larger value of VIF indicates that the fused images merged by our method provide better visual effect and are more consistent with the human vision system. At the same time, the computational efficiency of different methods is compared. Table 5 shows the computational efficiency of fusing four sets of images.

Source images and fused images of “People” by different methods.

Source images and fused images of “Street” by different methods.
Quantitative assessment of “People” with different image fusion methods
Quantitative assessment of “Street” with different image fusion methods
The running time (s) of different methods
Fig. 7(a)-(b) shows the infrared and visible image and Fig. 7(c)-(k) shows fused images obtained by different algorithms. In the area of junction marked by the green box, the edges of the rectangle are very clear and the details are well preserved in our fusion image. For the area of fist marked by the green box in Fig. 7(c)-(k), the RP based image fusion method is very blurry in the entire area and produces some noise. The DTCWT, CVT, GF, ASR, GTF, CSR based fusion methods produce some serious edge artifacts. Our method is more clear in the area of the finger and the edge of “fist’’, and it is easier to distinguish the fist from the background. Overall, the image obtained by our fusion method contains complementary information of visible images and infrared images, and the details of the characters are well preserved. Table 3 shows the objective metrics of “People” with different image fusion approaches. All metrics of our method are larger, especially the value of SD is much larger than the other methods. Consequently, the fused image contains richer information.
The fused images of “Street” with RP, DTCWT, CVT, GF, ASR, GTF, CSR, DTCWT-SR, and our method are shown in Fig. 8(c)-(k). The effort of RP is not ideal that some noise is introduced. The fused images based on the GTF image fusion method almost lose brightness and details. It can be seen that the fused images based on our method are closer to visible images in some parts and include more information on visible images, which is coordinate with the human visual system. Also, the generated images by our method can preserve edge information better. For example, the edge details in the two traffic lights of our method are better preserved in “Street”. In the area of a pair of pedestrians marked in the green box, it can be seen that the method we proposed preserves the edges and details of pedestrians better. Besides, the lines on the road are also clearer. Table 4 shows the objective metrics of “Street” with different image fusion approaches. All evaluation indexes with our method are much larger than other methods. These objective metrics fully demonstrate that our proposed method is better than other methods.
Experimental results show that our method is superior to the other eight methods in subjective and objective evaluation of scenes “Tank” and “leaf”. The details and edge information of fused images are well preserved. For the sake of space, the fusion results are not displayed one by one.
Table 5 shows that ASR has the lowest efficiency, and the RP has the highest computational efficiency. Even if the proposed method adopts CSR to represent the signal, the running time of the proposed method costs relatively less time than CSR due to the high computational efficiency of DTCWT. At the expense of reducing a little efficiency, the objective assessment and subjective assessment of our method are superior to others.
GF can enhance the edge information of images to make images clearer. Table 3-4 show the value of objective evaluation, and the maximum values are indicated in bold. The higher values of EN and SD mean the richer information contained by fused images. The value of MI evaluates the amount of information which transfers from the source image to the fused image. According to the experimental result, we can see that our proposed method performs well on details-preserving. Furthermore, the VIF plays a vital role in the human perception of images. However, other methods contain many ringing artifacts around the saliency features, and the details information is not distinct enough in some fused images. In contrast, the proposed method generates all four merge images perfectly, as more details are preserved in fused images, and the generated images are consistent with the human visual system. Therefore, better objective and subjective fusion performance can be achieved.
In this paper, an effective hybrid method for infrared and visible image fusion is presented and evaluated. It is vital for image fusion to choose a suitable tool for decomposition and an appropriate strategy of fusion. The basic kernel steps are as follows. Firstly, The source images are decomposed into their low-frequency band and high-frequency bands by DTCWT. The former contains structure information, while the latter contains more texture details. Secondly, for low-frequency bands, CSR is adopted to represent the structure of source images, and then the low-frequency bands of infrared and visible images are fused via the “choose-max” rule. For high-frequency bands, we employ guided filtering to enhance high-frequency bands information, which can preserve more texture details. And we fuse the high-frequency bands of source images via the median filter. Finally, inverse DTCWT was used to reconstruct the fused image.
The proposed method focuses on combining infrared images and visible images into fused images to obtain more efficient details. We evaluate the proposed method qualitatively and quantitatively, and the property of our proposed method is better than other compared methods. Above all, we introduce the hybrid structure containing DTCWT and CSR, and utilize GF to enhance images simultaneously. The experimental results show that the proposed method can require almost the best-performed results.
Footnotes
Acknowledgments
The authors would sincerely like to thank the editor and anonymous reviewers for their detailed review. The authors also sincerely thank Daguang Shao and professor Xiaomin Yang for their meaningful suggestions. The authors also thank Yu Liu and his research team for providing some codes on the public platform.
