Abstract
The rapid developments of computation, communication and control contribute to the generation of cyber physical systems (CPS). For full-time urban surveillance or military reconnaissance in complex environments, infrared and visible imaging sensors typically need to be integrated into the CPS. Furthermore, an effective and stable image fusion algorithm is important for CPS to provide images with rich information. Therefore, an image fusion algorithm for CPS is introduced in this paper. Compared with traditional multi-scale and multi-direction decomposition based algorithms, a more efficient MSMD based algorithm is proposed. Firstly, base layers reserved edges and detailed layers are obtained by multi-scale decomposition. Secondly, multi-direction decomposition is employed to base layers rather than detailed layers in traditional method. Then, serials of detailed layers and multi-directional base layers are obtained by choosing the max value based on patch. After the inverse transformation of multi-direction decomposition is conducted for multi-directional fused base layers, the reconstruction result is obtained via superposition of fused base and detail layers. Experiments prove that our algorithm outperforms the art-of-state.
Keywords
Introduction
In full-time surveillance or reconnaissance systems, it becomes an important issue that how to capture more information at night or in a complicated environment [1]. Generally, visible images cannot display the object in a complicated environment such as bushes, dense fog, and dark light, but can show clear background and details. In contrast, the object in infrared images could be presented via thermal radiation, but infrared images unable to exhibit a clear background and details. Consequently, an image fusion algorithm always utilized to merge information from infrared and visible imaging sensors in CPS [2]. As the Fig. 1 shows, image fusion algorithms blend the complementary information from the sensors [3], which makes the fused image more suitable for perception. In this paper, a stable and efficient MSMD based image fusion algorithm is proposed for surveillance or reconnaissance systems.

Image fusion of infrared imaging sensor and visible imaging sensor for CPS.
Currently, there are three kinds of image fusion algorithms: pixel-level, decision-level, and feature-level based algorithms. The pixel-level based algorithms are favored for their efficiency and simplicity. In pixel-level based algorithms, spatial [4, 5] and transform [6, 7] domain-based image fusion algorithms are the most popular. In addition, there are also many optimization-based algorithms.
As for information fusion of infrared and visible imaging sensors, spatial domain based algorithms will make the fused images lose details. The key issue in these algorithms is to obtain appropriate weighted maps of source images. To address this problem, many algorithms such as weighted average [4] and principal component analysis (PCA) [8] have appeared. However, it should be noted that the spectrum information is a big difference between infrared and visible images [9], which make the weights obtained by these algorithms is inappropriate. Therefore, some details and texture will lose in the fused image due to the inappropriate weights. Consequently, the spatial domain based algorithm unable to acquire satisfying fusion results from infrared and visible images.
Although transform domain based algorithms acquire more details than the spatial domain based algorithms, the artifacts will be produced in the fused images. The most popular type of transform domain based algorithm is the multi-scale decomposition (MSD) based algorithms [10–12]. Utilizing the multi-scale decomposition, these algorithms can fuse source images in different scale to obtain a better fused image. In the past few years, many MSD based algorithms have been presented, such as multi-scale contrast-based model [13], Gradient pyramid (GP) [14] and multi-scale morphological focus measure [15]. However, inappropriate scale decomposition will result in artifacts in fused images. Moreover, some details and texture still are lost in the fused image obtained by these algorithms. To obtain more details, a MSMD based algorithms, curvelet transform based algorithm (CVT) [16], is proposed. Components in different scales and directions can be extracted by CVT, which facilitates the fusion of detail and edge information. However, the CVT has no property of shift-invariance because of the down-sampling and up-sampling, which will result in spectral aliasing and distortion in the fused image. Therefore, a non-subsampled curvelet transform based algorithm (NSCT)is introduced to overcome the drawbacks of CVT by removing the down-sampling [10]. However, the multi-direction decomposition of detailed layers becomes time-consuming, since the down-sampling is removed.
Recently, an optimization-based algorithm fuses the infrared and visible images by gradient transformation and total variation (GTF) [17]. This optimization model merges the details in accordance with a regularization term. However, compared with the MSD based algorithms, only gradient information is utilized in the regularization term so that the detail in fused images is much less than that of the MSD based algorithms. Accordingly, the GTF has unstable performance on source images with rich details. In addition, the GTF is time-consuming due to the processing of the optimization problem.
Overall, some defects still need to be addressed in the pixel-based algorithms to blend information from infrared and visible imaging sensors: Fused images acquired by spatial-based algorithms lose lots of details considering that the spectrum is different between infrared and visible images. Artifacts always be introduced in fused images due to inappropriate scale decomposition of MSD based algorithms. Fused images obtained by MSMD based algorithms are either time-consuming or spectral aliasing. The optimization-based algorithms are time-consuming, and the GTF has unstable performance on different types of images.
Considering the problems mentioned above, an effective and stable algorithm based on MSMD is designed for CPS to fuse infrared and visible images. Since the processing of MSMD, details and texture are rich in fused images of our algorithm. Furthermore, the multi-direction decomposition is adopted to the base layer instead of detailed layers, which makes the computational time of our algorithm much lower than traditional MSMD based algorithms. The main contribution included in this paper is as follows: To integrate the equipment of urban surveillance and military reconnaissance into CPS, an effective and stable image fusion is proposed for infrared and visible imaging sensors. To better separate details, edges and low frequency information, a MSMD based algorithm is conducted in source images, which the MSMD is composed of rolling guided filter (RGF) and non-subsampled directional filter bank (NSDFB). To obtain a better effect of MSD, RGF is applied to decompose original images. RGF is a scale-aware and edge-preserving filter [18], which can preserve edges but remove small structures. To extract the edges in the base layer, the NSDFB is adopted in base layer. NSDFB could decompose the base layer into multi-directional component with property of shift invariance [19]. To save computation time, a fast MSMD based algorithm is introduced. Compared with traditional MSMD based algorithms like CVT and NSCT, the proposed algorithm is time-saving by only adopting the multi-directional decomposition on the base layer instead of serials of detailed layers.
The structure of our paper is organized as follows. The introduction is exhibited in Section 1. A briefly description of RGF and NSDFB is given in the Section 2. In Section 3, we explain the specific steps of the proposed algorithm. Then, experimental results and discussion will be represented in Section 4. In Section 5, we conclude the paper.
Rolling guided filter
The rolling guided filter proposed by Zhang et al. [18] can remove the small details but preserve the large-scale edge. With respect to other edge-preserved filters like weighted least square filter, the rolling guided filter with a fast convergence property only need few times iteration to obtain the filtered image. This algorithm is easy to implement and understand, and the specific realization of the process shown in Fig. 2. Step 1 is Gaussian filtering (GF), and Step 2 is joint bilateral filtering (JBF) with the number of iterations T. I is the input image of GF and JBF. G i represents i-times iterative result of JBF, where the G0 is the filtering result of Gaussian filter in Step 1. As shown in the schematic diagram, the entire filtering process can be divided into two stages.

The illustration of RGF.
First, small structure is removed by GF described as Equation (1). Given an input image I, G represents the output image, and the t and s represent the center pixel and neighborhood pixel in the Gaussian kernel, respectively.
The
Second, the edge is recovered by the joint bilateral filter. As the output of RGF, G
T
(t) is obtained by T times of jointly bilateral filtering (JBF). The definition of JBF can be described as follows:
where
is for normalization, and I in (t) indicates the input image. G i (t) is the output of i-th JBF and the guided image of (i + 1)-th JBF. The standard deviation of domain and range Gaussian kernel are presented by σ s and σ r , which is utilized to manipulate the spatial and range weight, respectively.
Inspired by the NSCT, multi-direction decomposition is utilized is our algorithm. However, there is a difference between the proposed algorithm and NSCT. The multi-directional decomposition is adopted to the base layer instead of the de-tailed layer. In NSCT, the non-subsampled Laplacian pyramid (NSLP) is used to decompose images into different scales. However, the NSLP has no property of edge-preserved, which result in that both details and edges are decomposed into detailed layers. Consequently, the edges and details are unable to well separated. In contrast, the large-scale edges of source images are retained in the base layer by filtering of RGF in our algorithm, while the NSDFB is applied to decomposing the base layer to extract large-scale edges.
NSDFB is a modified version of directional filter banks (DFB) by quincunx up-sampling instead of down-sampling and up-sampling [19], so the NSDFB possesses the property of shift invariance. NSDFB can decompose the 2-D frequency plane of image into multi-directional bandpass sub-bands, as shown in Fig. 3. If the index of directional decomposition is k, the image will be decomposed into 2 k directional sub-bands. The decomposition can be described as follows:

Non-subsampled directional filter bank with directional index k = 2.
where the DF (•) represents the filtering process of NSDFB, and the Ds,d means the component of d-th direction of the s-th layer.
Figure 4 illustrates the process of the proposed algorithm. Four steps are included in this algorithm. The specifics process of this algorithm is explained in Section 3.1-3.2, and a brief introduction is presented.

The framework of the proposed algorithm.
The MSD of source images: a base layer and serials of detailed layer are separated from source images by RGF.
The multi-direction decomposition of base layers: the base layer is decomposed into multi-direction base layers by the NSDFB, which each direction base layer is a different component in base layer.
Fusion of detailed layers and multi-direction base layers: the max-choosing strategy based on patch is adopted to the multi-directional base images and serials of detailed images to get multi-directional fused base images and fused detailed images.
Obtaining the fused image: after the fused base layer is acquired by applying the inverse transformation of the NSDFB to multidirectional fused base layer, the detailed layers are added on the fused base layer to acquire the final fused image.
In first step, the RGF is applied to decompose original images into different scales. Suppose the number of layers of MSD is L, and R, V represent the infrared and visible source images. In order to obtain well-defined multi-scale layers, two stages need to be done when decompose the images. Firstly, RGF is utilized to acquire base layers and images with different degrees of blur:
where the RGF (•) represents the filtering of RGF. The R i represents the i-th filtering result of infrared source image, R0 = I, and R L is the base layer.Similarly, the V i represents the i-th filtering result of visible source image, V0 = V, and V L is the base layer. The σ s and σ r denote the standard deviation of domain and range Gaussian kernel, which is same with the σ s and σ r in Equation (2). T is the number of iterations of JBF operating represented by Equation (2).
Secondly, detailed layers of different scales are obtained by difference between adjacent blurred image:
where the
In this step, NSDFB is adopted to the base layer to extract the edges and details in different direction, as the base layer retains lots of edges and details due to the characteristics of scale-aware and edge-preserving of RGF. For the base layers R
L
and V
L
, the multi-directional decomposition of base layers can be described as follows:
where the DF (•) represents the processing of NSDFB. The
The fused detailed layers and multi-directional based layer are obtained by a max-choose rule based on patch in this step. To merge more details into the fused image, we apply a max-choose rule to fuse the detailed layers and multi-directional base layer. However, infrared images always contain some noise and irrelevant details, which is undesirable in the final fused image. Therefore, a max-choose rule based on patch replaces the naive max-choose rule to remove noise but keep details. The special fusion rule is composed by following three stages: The initial decision map is obtained via max-choose based on patch as follows
The initial weight map is acquired by:
The multi-directional fused base layer and detailed layer can be acquired by following equation:
To acquire the final fused image, the fused base layer has to be obtained, firstly, by inverse transform of NSDFB as the Equation (16) shows. Then, the reconstruction result is obtained via superposition of fused base and detail layers, which can be described as Equation (17).
where IDF (•) denotes the inverse transform of NSDFB. B f and k indicate the fused base layer and index of direction decomposition, respectively. The F represents the final fused image.
Experiment setting
There are some major variables in the proposed algorithm, such as the level L of decomposition, the directional index k of NSDFB, the σ s σ r of RGF, and the iteration number t of joint bilateral in RGF. In this paper, we set L = 3, σ s = 3 σ s = 0.24, T = 4, and k = 2. Besides, nine other algorithms are compared with the proposed algorithm. These algorithms include five MSD based algorithms DTCWT [7], RP [20], LP [21], DWT [11], MSVD [22], two MSMD based algorithms CVT [16] and NSCT [10], and two recent algorithms GFF [5] and GTF [17]. The settings of these nine algorithms are consistent with corresponding papers. In this paper, some urban and military surveillance images are adopted in experiments. All these source images are from [23, 24].
In addition, some metrics about image quality assessment are selected to compare different algorithms objectively, including EN, MI, Qab/f and SD [25]. EN and MI are information theory based metrics. EN represents the information amount of an image [26], and MI indicates the information amount preserved in fused images from source images [27]. Qab/f means the amount of edge transmitted from source to fused images [28]. SD measures the amount of details and texture [29], which is a statistics based metrics. Larger value of these four metrics means better effects of algorithms.
Comparative analysis
In this subsection, our algorithm is compared the nine other algorithms mentioned in Section 4.1. The performance of these ten algorithms will be analyzed on various images.
Comparison with other algorithms
Five images, Dataset-1, in the Fig. 5 will be analyzed in detail, including three urban surveillance images shown in Fig. 5(a-c) and two military reconnaissance images shown in Fig. 5(d-e). For each image, the subject visual effect is analyzed at first. Then, the object assessment is evaluated after corresponding subject analysis.

Dataset-1 utilized in experiments.
The first set of fused images in Dataset-1 are arranged in Fig. 6, and a detail in each sub-figure is magnified at the left-bottom corner. Obviously, fused images of CVT, LP, DTCWT, and NSCT have some artifacts around the traffic light. Also, some texture of the road is lost in these fused images. The second light of the traffic light is disappeared in the fused image obtained GTF, which can be found in the magnified area. In the fuse image acquired by RP, distortion is appeared around the traffic light. Compared with GFF and our algorithm, the fused image of WT is low in brightness and contrast. As for MSVD, aliasing is produced at the edge of traffic light. However, the fused image obtained by the propose algorithm has a clear and nature edge, and the brightness is also higher than all other fused images. To compare different algorithms objectively, the quantitative assessment is presented in Table 1. Obviously, the best results are gained by our algorithm. The fused image of our algorithm obtains the highest scores in EN and MI, which means that it retains more information than other algorithms from source images. In addition, the rich details and texture can be indicated by high values of Qab/f and SD, which the highest scores are gained by our algorithm.

Experimental results of different algorithm for the first image in Dataset-1.
Quantitative analysis of Fig. 6 under different algorithms
Figure 7(c-l) shows fused images obtained by different algorithms of another pair of source image. It is same with the fused image in Fig. 6. The edge of the people in fused images obtained by CVT, DTCWT and NSCT has obvious artifacts. The contrast is low in fused images obtained by GTF and WT, and some details are lost in these fused images. As for the fused image acquired by RP, the edge of the people is blurred, and some noise appears around the person. In the fused image of MSVD, it can be seen that the pixel blocks are appeared around the edge of the person. The fused images of GFF and LP are similar to the proposed algorithm, but the contrast and brightness are lower than the proposed algorithm. As shown in the Table 2, we can find the EN, MI and SD of the proposed algorithm are highest in all algorithms, and the Qab/f is only lower than LP and GFF. However, the MI and SD of GFF and LP are much lower than the proposed algorithm. Therefore, our algorithm has the best effect in this pair of source images.

Experimental results of different algorithm for the second image in Dataset-1
Quantitative analysis of Fig. 7 under different algorithms
In Fig. 8, some information is lost in the fused images of CVT and GTF, and the edge of the object is unnatural. In the magnified area of fused images obtained by DTCWT and RP, it can be found that artifacts appear around the object. It is like the result in Figs. 6(i) and 7(i), there are many pixel blocks around the edge of the object in the fused image acquired by MSVD. As for fused images obtained GFF, LP, WT, NSCT and the proposed algorithm, it is difficult to tell the difference of these images in visual effect, but the brightness of WT and NSCT is lower than GFF, WT and the proposed algorithm. Table 3 illustrates the quantitative result. Though Qab/f of our algorithm is lower than the GFF, it is close to the GFF. In contrast, the MI and SD of GFF are much lower than the proposed algorithm. It shows that more information is preserved in fused images by our algorithm.

Experimental results of different algorithm for the third image in Dataset-1.
Quantitative analysis of Fig. 8 under different algorithms
Moreover, two pairs of images about military reconnaissance are adopted to verify the effect of the proposed algorithm in military reconnaissance. The first pair image is shown in Fig. 9(a, b), and Fig. 9(c-l) are fusion results of different algorithms. For the visible image, we can just find an object is held in the middle person, but a concealed weapon is appeared inside the shirt of the right person in the infrared image. Therefore, we can get more information by fusing original images. However, the fused images of CVT, DTCWT, LP, RP, MSVD, WT and NSCT have low brightness of the concealed weapon. Moreover, a lot of noise appears in the fused image acquired by RP so that the concealed weapon is unable to recognize. The brightness of the concealed weapon is high in the fused image of GTF, but lots of details are lost. From the magnified area, it also can prove that some information in Fig. 9(b) disappears in the fused image of GTF. In contrast, rich information is preserved in the fused image of our algorithm, and the brightness is high. Compared with GFF, the fused image has clearer edge in the neckband of the left person. Table 4 lists the quantitative assessments. Obviously, the best result of each metrics is achieved by the proposed algorithm, which agrees with the result of visualeffect.

Experimental result of different algorithm for the fourth image in Dataset-1.
Quantitative analysis of Fig. 9 under different algorithms
To verify the fusion effect of details, a pair image with rich details and texture is adopted in this experiment. Figure 10(a) and (b) are original images, and Fig. 10(c-l) are fused images of different algorithms. Noise still appears in the fused image obtained by RP. There are some pixel blocks around the edge of the tank in the magnified area of MSVD. Compared with the fused image of the proposed algorithm, fused images obtained by CVT, DTCWT, GFF, GTF, LP, WT, GTF, and NSCT lose lots of details, especially by the GTF. In addition, we can find the edge information in the magnified area of the proposed algorithm is richer than others. The quantitative assessment is presented in the Table 5. All the metrics of proposed are highest, which agree with the visual effects.

Experimental results of different algorithm for the fifth image in Dataset-1.
Quantitative analysis of Fig. 10 under different algorithms
To analyze the performance in various images of different algorithms, another five pairs of source images shown in Fig. 11 are used in this experiment. The fused images of these source images are presented in Fig. 12. Fused images of Fig. 11(a) are listed at the first column in Fig. 12. Obviously, the contrast of the fused image of our algorithm is higher than other algorithms, and the edge between trees and sky is clearer. In second column, the brightness of the proposed algorithm is the highest. In the third column, obvious artifacts can be found in the fused images obtained by RP, and the brightness of the object is low in all the fused images except the GTF and the proposed algorithm. However, the fused image obtained by GTF has few details from the visible image. From the fourth column in Fig. 12, it can be found the details of trees in the fused image obtained by GTF, LP, RP, MSVD, and WT are less than CVT, DWCWT, GFF, NSCT, and the proposed algorithm. Compared with GFF and the proposed algorithm, the fused images obtained by CVT, DTCWT and NSCT is darker. The fused images of the last image in Fig. 11 are shown in the last column of Fig. 12. It can be found the object in fused images acquired by RP, MSVD and WT are almost disappeared. The island is lost in the fused image obtained by GTF. However, the island and object are clear in the fused image of our algorithm.

Dataset-2 utilized in experiments.

Experimental results of different algorithm for all images in Dataset-2.
Also, the quantitative assessments are conducted in this experiment. Figure 13 indicates the means of EN, MI, Qab/f and SD of the ten fused images obtained from Dataset-1 and Dataset-2. Although the Qab/f is lower than GFF, the proposed algorithm achieves the best result in EN, MI and SD. It means the proposed algorithm has the best performance for different images between these algorithms. To compare the stability of each algorithm, the ratio of standard deviation to mean is listed in the Table 6. The EN and MI of our algorithm are the minimal among all algorithms, which means our algorithm achieving the most stability performance in EN and MI. Although the Qab/f and SD of the proposed algorithm are not the best, it ranks among the top. Therefore, the proposed algorithm is better in terms of stability compared with other algorithms.

The means of each metrics acquired by different algorithms.
Ratio of variance to mean
In addition, a comparison of computational time of all these images is presented in Table 7. The experiments in this paper are implemented on a computer with 16G RAM and Intel i7-4790@3.69GHz. The computational time is measured by average time spent on 20 replicate experiments. Although, the computation time of the proposed algorithm is more than LP, RP, GFF, DTCWT, MSVD and WT, it is much lower than the GTF and traditional MSMD based algorithms like CVT and NSCT.
Average computational time of different algorithms
An image fusion algorithm of infrared and visible imaging sensors is proposed for CPS in this paper. First, the details and large-scale edge of source image are extracted by RGF and NSDFB. Then, a max-choosing rule based on patch is utilized to fuse the detailed layers and multi-directional base layers. Comparative experiments show that our algorithm has the best effects. In addition, our algorithm has better stability than others, which is more suitable for CPS.
However, there are a lot of works to further research about this algorithm. Firstly, further research is need in fusion rule, which is designed to merge information from infrared and visible imaging sensors. Besides, more research of the proposed algorithm is to fuse the remote sensing.
Footnotes
Acknowledgments
This research is funded by National Nature Science Foundation of China (NO. 61771378) and Science Foundation of Sichuan Science and Technology Department (NO. 2018GZ0718).
