An image super-resolution method for better cognition of images in cognition computing system

Abstract

Data cognition plays an important role in cognitive computing. Cognition of low-resolution (LR) image is a long-stand problem because LR images have insufficient information about objects. For better cognition of LR images, a multi-resolution residual network (MRRN) is proposed to improve image resolution in this paper for cognitive computing systems. In MRRN, a multi-resolution feature learning (MRFL) strategy is introduced to achieve satisfying performance with low computational costs. Inspired by image pyramids, a feature pyramid is designed to implement multi-resolution feature learning in the building unit of the proposed MRRN. Specifically, multi-resolution residual units (MRRUs) are introduced as the building units of the proposed network, which consist of a feature pyramid decomposition stage and a feature reconstruction stage. To obtain informative features, transferred skip links (TSLs) are utilized to transfer fine-grain residual features in the pyramid decomposition stage to the reconstruction stage. The effectiveness of MRFL and TSL is demonstrated by ablation experiments. Also, the tests on standard benchmarks indicate the superiority of the proposed MRRN over other state-of-the-art methods.

Keywords

Artificial intelligence deep learning convolutional neural networks computer vision image super-resolution

1 Introduction

A key in cognitive computing is understanding the input data. Visual data, such as images and videos, accounts for the vast majority of human acceptance information [4]. Also, it is an important part of the input data of cognitive computing systems. As reported in [6 , 38], images with higher resolution enable recognition algorithms to obtain better results, which means that higher resolution contributes to better cognition of images. This is because much more significant structures and texture details are provided by images with high resolution (HR) than low resolution (LR). Although HR images facilitate machine cognition and subsequent analysis, the acquisition of HR images is always limited by lots of factors, such as expensive hardware devices. Therefore, to help cognitive computing systems obtain better cognition of LR images with low costs, an image super-resolution (SR) method based on deep learning is proposed to improve image resolution in this paper.

Image super-resolution (SR) is an effective and promising alternative technique to obtain an HR image from a LR one in a software manner. Existing SR methods include interpolation [46], reconstruction [39], and learning [9 , 42] -based methods. The interpolation-based SR methods are efficient, whereas blurred edges are inevitable in result images obtained by those methods. The reconstruction-based methods are time-consuming, and their performance heavily depends on the parameter settings. As for learning-based SR methods, there are two types of them, which are conventional [30] and deep [5 , 29] learning-based methods, respectively. Conventional learning-based methods, e.g., sparse representation [41], compressed sensing [28], and random forest [31] -based methods, have limited representation ability. Benefiting from the strong representation ability, deep learning-based SR methods recently have made significant improvement in the SR performance [9 , 48].

However, most existed deep learning-based SR methods hardly consider multi-resolution features learning. Li [22] et al. proposed a multi-scale residual network to capture multi-scale information by convolutional layers with different kernel sizes. Nevertheless, the features in MSRN are still in one resolution. Although Guo [11] et al. combined the wavelet transform and convolutional neural network for image SR. However, the wavelet transform was only adopted to transform images, and then the transformed images were used to supervise network training. Inspired by multi-resolution analysis in classical image processing, such as the pyramid decomposition, we suggest that the LR features can represent coarse information of features and the HR residual features can represent the fine-grain information of features.

To explore multi-resolution feature learning for image SR, we design a feature pyramid and propose a multi-resolution residual network. The multi-resolution feature learning strategy can enable the network to obtain competitive results with general feature learning with fewer computational costs. Specifically, we propose a multi-resolution residual unit (MRRU) as the building unit of our MRRN, in which a feature pyramid is devised to learn multi-resolution features. In MRRU, multi-resolution features are firstly obtained by residual feature extracting groups (RFEG) in feature pyramid decomposition stage. Then, HR features will be reconstructed in the reconstruction stage by feature refining groups (FRG). The purpose of multi-resolution features is to reduce computational costs as well as learn the relationship among different resolutions, which is beneficial to obtaining an advanced mapping representation from LR to HR. To recover informative HR features, transferred skip links (TSL) are adopted to transfer hierarchical feature to the reconstruction stage from corresponding level in decomposition level. In summary, the contributions of this paper are summarized as follows:

An effective image multi-resolution residual network (MRRN) is proposed to improve the image resolution for better cognition of LR images in cognitive computing systems.

Imitating image pyramid decomposition and reconstruction, feature pyramid decomposition and reconstruction are designed in the building unit (i.e., MRRU) of the proposed network.

Combining the multi-resolution features learning (MRFL) and transferred skip links (TSL), the proposed network can obtain satisfying performance with much fewer computational costs.

The subsequent structure of this paper is organized as follows. First, some previous work related to our method is described in Section 2. Then, we describe the baseline and the proposed MRRN in Section 3. The experimental results are presented in Section 4. Finally, the conclusion is summarized in Section 5.

2 Related work

2.1 Deep learning based super-resolution

Dong et al. [9] proposed the first deep convolution neural network for image SR named SRCNN, which only contained three layers. The three layers corresponded to LR feature extracting, non-linear mapping, and image reconstructing in the sparse representation based method, respectively. Subsequently, some advanced methods focused on designing a deeper or wider network to improve the performance of SR [16 , 47]. Kim et al. [16] presented a global residual learning strategy and a gradient clipping strategy in their VDSR. Besides, they also came up with a deep recurrent neural network (DRCN) for reducing parameters of deep SR networks [17]. However, these methods had a common deficiency that the input LR image was interpolated to HR size. Thus, the model would take more computing time than the same model with LR size input. To make networks output an HR image with an LR size input, Dong et al. [8] utilized deconvolution to up-sample feature maps in FSRCNN. Shi et al. [32] also proposed a pixel-shuffle layer for up-sampling feature maps. Recently, some state-of-the-art methods have been proposed for image SR, including DRRN [35], MemNet [36], SRResNet [21], and SRMD [45].

Although there are lots of SR networks, the proposed network differs from them in two aspects. First, a multi-resolution feature learning strategy is utilized in the building unit, while most of the other SR networks only use features of one resolution.

Second, a modified residual block is introduced in our multi-resolution residual unit, which can directly back-propagate gradient of deep layers to shallow layers when the number of input channels is different from that of output channels in residual blocks.

2.2 Residual learning

There are two types of residual learning: global residual learning [16] and local residual learning [12]. The global residual learning was introduced by VDSR [16] to learn the residual image of the up-sampled LR image, while the local residual learning was introduced by ResNet [12] to learn the residual features. Besides, ResNet proposed a classic building block, i.e., residual block. Adopting the residual block, Ledig et al. [21] proposed SRResNet and SRGAN for image SR. Since the batch normalization in the original residual block will change the diversity of features values [24], which is not suitable for image SR. Lim et al. [24] modified the vanilla residual block and proposed EDSR for image SR. Combining local residual and global residual learning in a recurrent unit, DRRN could train a very deep network for image SR. Different from these methods, the proposed network learns residual features from lower resolution features and combines features of two different resolution to learn residual for higher-resolution features.

2.3 Multi-resolution networks

There are two categories of multi-resolution networks, i.e., sequential multi-resolution networks [2, 27] and parallel multi-resolution networks [34]. The multi-resolution features are sequential in sequential multi-resolution networks while the multi-resolution features are parallel in parallel multi-resolution networks. Encoder-decoder networks [2] can be viewed as the earliest sequential multi-resolution networks as the resolution of feature maps is reduced in the encoding stage and increased in the decoding stage. Encoder-decoder networks are adopted to various vision tasks, such as segmentation [2]. To improve its performance, Newell et al. [27] proposed an hourglass network to combine the features of the same resolution in encoder and decoder by a skip connection. In cascading pyramid networks [7], multi-resolution features are cascaded to obtain the highest-resolution features. To maintain high-resolution features in the whole network, a high-resolution network utilized parallel multi-resolution features. However, this parallel multi-resolution feature learning requires lots of computational resources because of numerous features and convolutional layers.

Although the multi-resolution residual unit is also an hourglass and pyramid architecture, it is different from those in three aspects. First, those networks are proposed for high-level vision tasks, such as segmentation and human pose estimation, while the proposed method is proposed for image SR, a low-level vision task. Second, the hourglass and pyramid architecture is adopted in multi-resolution residual units. However, the above networks adopt this structure for the entire network. Third, residual groups are used in each resolution stage in MRRUs, while the hourglass network adopts the vanilla convolutional layers.

3 The proposed method

3.1 Preliminary

As illustrated in Fig. 1, the baseline of the proposed network is an EDSR-liked network. The convolutional layer after the last residual block in the original EDSR is removed in the baseline. The pixel-shuffle layer in the original EDSR is replaced by deconvolutional layers in the baseline. There are D × (2G + 1) × R residual blocks in the baseline. The residual blocks can be formulated by following equation:

Fig. 1

The architecture of the baseline.

$x_{out} = F_{res}^{id} (x_{in}) = x_{in} + f_{2} (δ (f_{1} (x_{in})))$ (1) where x_in and x_out are the input and output of the residual blocks, respectively. f₁ (·) and f₂ (·) denote the convolutional operation of the first layer and the second layer, respectively. δ (·) represents the non-linear activation function.

Similar to the baseline, as shown in Fig. 2, the proposed network is constructed by D units (i.e., MRRUs) and each unit consists of 2G + 1 residual groups, where each group includes R residual blocks.

Fig. 2

The architecture of our MRRN.

3.2 Network architecture

As shown in Fig. 2, the proposed MRRN consists of three parts: feature extractor module (FEM), non-linear mapping module (NLMM), and image reconstruction module (IRM). First, the shallow LR features will be extracted by FEM from the input LR image. Then, the shallow features are fed into NLMM, which consists of several MRRUs. Finally, deep features outputted by NLMM are utilized to generate the HR output. The details of the three steps are described as follows.

Feature Extraction. The FEM is constructed by a single convolution, which is adopted to extract features from LR input images. The shallow features can be obtained by $x_{e} = F_{e} (x)$ (2) where F_e (·) and x denote the function of the convolutional layer and the input LR image, respectively.

Nonlinear Mapping. NLMM is composed by several stacked MRRUs as shown in Fig. 2. We mark the function of the i-th MRRUs as F_i (·), and suppose that the number of MRRUs in NLMM is D. Thus, the output of NLMM can be formulated as follows $x_{D} = F_{D} (x_{D - 1}) = F_{D} (\dots F_{1} (x_{0}) \dots),$ (3) where x_i represents the input of the i-th MRRU. x₀ = x_e is the input of NLM.

Image Reconstruction. As illustrated in Fig. 2, the sum of the output and the input of NLMM is fed into REM to reconstruct HR images, where REM includes a deconvolutional layer and a convolutional layer. Hence, the image reconstruction is formulated as follows $y = F_{re} (F_{deconv} (x_{D} + x_{e}))$ (4) where y is the SR image. F_re (·) denotes the function of the last convolutional layer. F_deconv (·) represents the deconvolutional layer.

We adopt the L₁ loss function in the training stage because the convergence speed of networks adopting L₁ loss function is faster than networks adopting L₂ loss function. Given a training set $D = {x^{(i)}, {\hat{y}}_{(i)}}_{i = 1}^{N}$ , N is the number of training examples in $D$ . Thus, the loss function of our MRRN can be expressed as follows $L (θ) = \frac{1}{N} \sum_{i = 1}^{N} ∥ y^{(i)} - N_{net} (x^{(i)}, θ) ∥,$ (5) where N_net (·) and θ represent the function and the parameters of our MRRU, respectively.

3.3 Multi-resolution residual units

Inspired by classic multi-resolution image processing technologies, such as the image pyramid methods, we argue that the features of different resolution also possess information of different levels. For example, after pooling, the resolution of features is reduced and some information is lost. However, the resolution-reduced features can still represent the basic information of original features. The resolution-reduced features are akin to the low-frequency LR images in image pyramids. The residual features between the original features and resolution-reduced features are capable of representing fine-grain information, which are akin to the sub-band images in image pyramids. Based on the inspiration, a feature pyramid is devised in multi-resolution residual unit (MRRU) as shown in Fig. 3-(a). To clearly show the feature pyramid, an illustration of features in MRRU is presented in Fig. 3-(b). As for each MRRU, there are two stages, feature pyramid decomposition and feature reconstruction, which are detailed as follows.

Fig. 3

(a) is the architecture of MRRU. (b) is the illustration of feature maps in MRRU.

Feature pyramid decomposition. As with image pyramid decomposition, we call the process of obtaining multi-resolution features by RFEGs as feature pyramid decomposition. As shown in Fig. 3-(a), G + 1 residual feature extracting groups (RFEGs) are used to achieve feature pyramid firstly. The resolution of features in a specific pyramid level is ×2 of that in the next one. Max-pooling operation is adopted as the down-sampling to obtain feature of the higher level in the feature pyramid. Therefore, given the input x_i-1 of MRRU, the multi-resolution features in the feature pyramid can be obtained by the following formulation: $\begin{matrix} z_{1} = F_{rfeg}^{1} (x_{i - 1}) \\ z_{g} = F_{rfeg}^{g} (F_{mp} (z_{g - 1})) \end{matrix}$ (6) where F_mp (·) denotes the max-pool operation. z_g denotes the features outputted from the g-th RFEG. $F_{rfeg}^{g} (\cdot)$ represents the function of the g-th FREG.

As shown in Fig. 3-(a), in each FREG, R residual blocks are stacked to extract informative features at current level. Therefore, RFEG can be formulated as follows: $z_{g} = F_{res, R}^{id} (\dots (F_{res, 1}^{id} (f_{mp} (z_{g - 1})) \dots),$ (7) where $F_{res, r}^{id} (\cdot)$ denotes the function of the r-th residual block in the current RFEG. The detail of $F_{res}^{id} (\cdot)$ is shown in Equation 1. Corresponding to a image decomposition level in the image pyramid, z_g is the feature decomposition level in the feature pyramid. Then, z_g is divided into two branches (see Fig. 3 (b)), i.e., the transfer skip link (TSL) and the branch to the next RFEG. TSL is used to learn the fine-grain residual features $z_{g}^{r}$ from z_g, which corresponds to the sub-band image of g-th level in image pyramids. Besides, TSL can not only transfer $z_{g}^{r}$ to the subsequent feature refining group (FRG) for reconstruction of HR features but also help gradient back propagation. TSL is built with T residual blocks, so the fine-grain residual features of z_g can be obtained by $z_{g}^{r} = F_{res, T}^{id} (\dots (F_{res, 1}^{id} (z_{g}) \dots)$ (8) where $F_{res, t}^{id} (\cdot)$ denotes the function of the t-th residual blocks in TSL. As for the branch to the next FREG, the output of the current level is down-sampled by the max-pooling operation. Then, the decomposition process will continuously iterate according to Eqs. 6 and 8 until the output (i.e., z_G+1) of the last RFEG is obtained.

HR feature reconstruction. After input features of MRRU are decomposed by G + 1 RFEG, there are G feature refining groups (FRGs) to reconstruct HR features, which corresponds to the image reconstruction in image pyramid methods. The HR feature reconstruction process can be formulated as follows: $\begin{matrix} h_{0} = z_{G + 1}, \\ h_{g} = F_{frg}^{g} ([F_{up} (h_{g - 1}), z_{G + 1 - g}^{r}]) \end{matrix}$ (9) where h_g is the output of the g-th FRG. $z_{G + 1 - g}^{r}$ obtained by Eq. (8) is the fine-grain residual feature from TSL. $F_{rfg}^{g} (\cdot)$ denotes the function of g-th RFG. [·] and F_up (·) represent the concatenation and the bi-linear up-sampling operation, respectively. In Eq. (9), h_g-1 and $z_{G + 1 - g}^{r}$ correspond to the coarse LR images and the details images of the (G + 1 - g)-th level in the image pyramid. Similar to decomposition process, the reconstruction process will also continuously iterate until obtaining the output of the current MRRU, i.e., h_G.

Similar to RFEG, each FRG also consists of R residual blocks. However, as shown in Fig. 3-(a), the input of the first residual block is the concatenation of the up-sampled feature from the preceding residual group and the fine-grain residual from TSL. Therefore, we modify the vanilla residual blocks with identity mapping [12] to handle the case, in which the number of the input channels is different from that of the output channels. The modified residual block is formulated as follows: $\begin{matrix} x_{out} = F_{mres} (x_{in}) = r (x_{in}) + H (x_{in}), \\ r (x_{in}) = \frac{1}{k} [\sum_{i = 1}^{k} x_{in}^{(i)}, \sum_{i = k + 1}^{2 k} x_{in}^{(i)}, \dots, \sum_{i = (k - 1) C_{out} + i}^{{kC}_{out}} x_{in}^{(i)}] \end{matrix}$ (10) where F_mres (·) represents the function of the modified residual block. H (·) denotes the function of two convolutional layer with an activation function between them. r (·) is feature shrinking function to shrink feature ×k in the channel dimension, which is similar to average pooling along the channel dimension. k = C_out/C_out, where C_in and C_out are the numbers of the input channels and the output channels, respectively. $x_{in}^{(i)}$ denotes the i_th channel in x_in. As shown in Fig. 3 (a), following the modified residual block, R - 1 original residual blocks are stacked. Therefore, FRG can be formulated as follows: $\begin{matrix} h_{g} = F_{res, R - 1}^{id} (\dots (F_{res, 1}^{id} (h_{g}^{tmp})) \dots), \\ h_{g}^{tmp} = F_{mres} ([F_{up} (h_{g - 1}), z_{G + 1 - g}^{r}]) \end{matrix}$ (11) where $h_{g}^{tmp}$ obtained by Eq. (10) is the output of the modified residual block.

3.4 Architecture details

The architecture details of the proposed network are shown in Table 1. The numbers of MRRU and FRG are set to 5 and 2 (i.e., D = 5, R = 2), respectively. Two residual blocks (i.e., R = 2) are placed into each group. A single residual block is adopted in TSL, i.e., T = 1. Except the first convolutional layer in modified residual blocks, the number of the input and the output channels of the convolutional layers in residual blocks is set to 64. Their kernel size, stride, and padding are set to 3, 1, and 1, respectively. In REM, three different deconvolutional layers are used to up-sample features with different scales.

Table 1
The architecture details of the proposed network. Deconv/Conv (C_in, C_out, k, s, p) represents the deconvolutional/ convolutional layer with C_in input channels and C_out output channels. k, s, and p denote the kernel size, stride and padding, respectively. I-ResBlock denotes the original residual blocks with identity mapping. M-ResBlock denotes the modified residual block

FEM Conv (3, 64, 3, 1, 1)

NLMM

1st MRRU

RFEG RFG

1 I-ResBlock TSL I-ResBlock

I-ResBlock

⋮

G I-ResBlock TSL I-ResBlock

I-ResBlock

G+1 I-ResBlock

I-ResBlock

1 M-ResBlock

I-ResBlock

⋮

G M-ResBlock

I-ResBlock

⋮

D -th MRRU

REM Deconv (64, 64, 6, 2, 2) for ×2

or Deconv (64, 64, 7, 3, 2) for ×3

or Deconv (64, 64, 8, 4, 2) for ×4

Conv (3, 64, 3, 1, 1)

FEM		Conv (3, 64, 3, 1, 1)
NLMM
1st MRRU
RFEG	RFG
1	I-ResBlock	TSL	I-ResBlock
	I-ResBlock
⋮
G	I-ResBlock	TSL	I-ResBlock
	I-ResBlock
G+1	I-ResBlock
		I-ResBlock
	1		M-ResBlock
			I-ResBlock
			⋮
				G		M-ResBlock
					I-ResBlock
						⋮
	D -th MRRU
REM	Deconv (64, 64, 6, 2, 2) for ×2
	or Deconv (64, 64, 7, 3, 2) for ×3
	or Deconv (64, 64, 8, 4, 2) for ×4
	Conv (3, 64, 3, 1, 1)

4 Experiments and results

4.1 Dataset and training setting

We use DIV2K [1] as the training set, and five standard benchmarks as test sets, which include Set5 [3], Set14 [43], B100 [25], Urban 100 [13], and Manga 109 [26]. Both the training set and the test sets are commonly used in SR community. As [47], three types of degradation are adopted for testing, including bicubic down-sampling (denoted as BI), blurring followed by bicubic down-sampling (denoted as BD), and additive Gaussian noise following bicubic down-sampling (denoted as DN). We apply data augmentation to the training set, such as random cropping, flipping, and rotation. In the training stage, a mini-batch contains 16 patches and each patch is cropped to 48 × 48 for training. Adam [18] is utilized to optimize the network, and its initial learning rate is set to 10^-4. The total number of iterations of our MRRN is 10⁶, and the learning rate is halved every 2 × 10⁵ iterations. All the experiments in this paper are based on Pytorch by desktops with GTX1080Ti/ RTX2080Ti GPUs.

4.2 Investigation of G, R, and T

In this subsection, we investigate the basic hyper-parameters of MRRN: the number of FRG (denoted as G), the number of residual blocks in residual groups (denoted as R), and the number of residual blocks in TSL (denoted as T). To reduce the computing time and resources, the basic number of feature maps is set to 32 (i.e., M = 32), which is same with the setting for ablation study in Section 5.3. When we investigate one of the hyper-parameters, the other two hyper-parameters are fixed. As shown in Table 2, larger G, R, or T leads to higher performance. This is because the network becomes deeper and parameters increase. However, the bigger hyper-parameter of G, R, or T also results in much more computing time and resources consumption. Besides, it can be observed that the performance of the network is more sensitive to R than G and T. Considering a trade-off between performance and computational costs, we set G = 2, R = 2 and T = 1 for the final network.

Table 2
The performance and parameters analysis for different hyper-parameters

Variable R (T = 0, G = 2) T (R = 2, G = 2) G (R = 2, T = 0)

1 2 3 0 1 2 1 2 3

Params. 622 K 1.08 M 1.55 M 1.08 M 1.27 M 1.45 M 668 K 1.08 M 1.50 M

PNSR 31.36 31.65 31.76 31.65 31.67 31.71 31.61 31.65 31.66

SSIM 0.8821 0.8868 0.8888 0.8868 0.8868 0.8886 0.8861 0.8868 0.8873

Variable	R (T = 0, G = 2)	T (R = 2, G = 2)	G (R = 2, T = 0)
Params.	622 K	1.08 M	1.55 M	1.08 M	1.27 M	1.45 M	668 K	1.08 M	1.50 M
PNSR	31.36	31.65	31.76	31.65	31.67	31.71	31.61	31.65	31.66
SSIM	0.8821	0.8868	0.8888	0.8868	0.8868	0.8886	0.8861	0.8868	0.8873

4.3 Ablation study

To validate the effectiveness of multi-resolution feature learning (MRFL) and transfer skip links (TSL). We analysis the performance and computational costs of four networks: the baseline, MRRN without MRFL, MRRN without TSL, and MRRN. The basic features number of convolutional layers is set to 32 in this experiment. The results are shown as Table 3. First, it is obvious that the network lacking the TSL obtains much worse results than MRRN and the baseline. This is because the network without TSL cannot transfer middle levels in feature pyramid to subsequent FRG to reconstruct HR features, there-by resulting in information loss in HR feature reconstruction. Naturally, the performance obtained by the network without TSL decreases dramatically. This comparison proves the indispensability of TSL in the proposed network. Then, comparing MRRN with the baseline and MRRN without MRFL, we can find that the performance and parameters of them are very close. However, the multiplication and addition operations (MADDs) of the proposed network are much fewer than the baseline and the network without MRFL. This is because MRFL reduces the resolution of features, thereby decreasing lots of computation. Overall, both MRFL and TSL are necessary to achieve the feature pyramid in the proposed network. According to above comparative experiments, it can be shown that TSL facilitates performance improvements while MRFL is beneficial to reducing computational costs.

Table 3
Investigation of the effectiveness of MRFL and TSL. MADDs are calculated under the task of reconstructing a 1280 × 720 HR RGB image

Methods Baseline MRRN

w/o TSL MRFL w/o TSL w/o MRFL Full

PSNR 32.05 30.45 32.07 32.03

MADDs 259.99 G 143.41 G 331.69 G 188.23 G

Params. 993 K 993 K 1.26 M 1.26 M

Methods	Baseline	MRRN
PSNR	32.05	30.45	32.07	32.03
MADDs	259.99 G	143.41 G	331.69 G	188.23 G
Params.	993 K	993 K	1.26 M	1.26 M

4.4 Investigation of different MRFL strategies

To investigate different MRFL strategies, we compare the proposed network with a variant. In the variant of MRRN, the positions of the max-pooling operation and the bi-linear interpolation are swapped. In other words, the features are up-sampled by interpolation and then pooled by max-pooling operation in MRRU of the variant. The PSNR curves of these networks on the validate set are shown in Fig. 4. ‘down-up’ and ‘up-down’ refer to the proposed network and the variant, respectively. ‘baseline’ is the network described in Section 3. A quantitative analysis on a test set is presented in Table 4. Although the baseline and the variant obtain higher results than the proposed MRRN, they take more computing time. For example, the training time of the variant is 4.77 times of MRRN. This is mainly because the resolution of middle level features is improved in the variant, which requires more computation and naturally takes more computing time. The performance improvements brought from the variant appear small compared to much more costs of computation. Therefore, the proposed network is more practical in realistic scenarios than the variant.

Fig. 4

Study performance of different MRFL strategies on Set5 for ×4 factor SR.

Table 4

The performance, training time, and test time for ×4 factor SR. The training time is the average time taken to train an epoch for DIV2 K. Test time is the average result for each image in the Urban100

Method	PSNR	SSIM	Training time	Test time	Params.
Baseline	25.36	0.7610	1’25(min)	0.0548(s)	1.08(M)
Up-down	25.38	0.7624	5’10(min)	0.0699(s)	1.08(M)
Down-up	25.30	0.7579	1’05(min)	0.0473(s)	1.08(M)

4.5 Investigation of different feature fusion strategies for TSL

To study different strategies to fuse the LR features from FRG and the transferred feature from TSL, we also compare the proposed MRRN with another variant. In this variant, the modified residual blocks are replaced by original residual blocks and the transferred features from TSL are added to the up-sampled feature from FRG. The number of residual blocks in TSL is set to T = 0 in this experiment. The results of the comparative experiments are depicted in Fig. 5. ‘Base’ and ‘plus’ represent the proposed network without TSL and the variant in this experiment, respectively. From Fig. 5, it is obvious that the proposed MRRN (labeled as ‘concate’) achieves the best results. This is because that the proposed network combines transferred features and LR features to learn reconstruction features, while the variant merely adopts the transferred features as residual of LR features. Combining the transferred features and LR features enables the FRG to receive both the LR feature and transferred features, thereby implicitly grasping the relationship of different resolution. Therefore, the proposed method can obtain better results than the variant.

Fig. 5

Study performance of different fusion strategies for TSL on Set5 for ×4 factor SR.

4.6 Comparison with state-of-the-art methods

Results on BI models. In order to confirm the effectiveness of our MRRN, 8 state-of-the-art methods are compared with our MRRN, including SRCNN [9], VDSR [16], SRResNet [21], DRRN [35], MemNet [36], LapSRN [19], and SRMDNF [45]. The results of these compared methods are obtained by the codes published by corresponding papers. Like [24], a self-ensemble method marked as MRRN+ is adopted to further improve the performance of MRRN. Quantitative results of PSNR and SSIM are listed in Table 5. Obviously, MRRN+ achieves the best results in all benchmarks, and our MRRN also outperforms all other methods with less inference time than MemNet and LapSRN. This is mainly because MRRN adopts multi-resolution feature learning and transfer skip links, which can reduce the computation while keeping the SR performance (proved in Section 4.3). The visual quality of SR images obtained by these methods on BI models for ×4 factor SR is shown in Fig. 6. For ‘8023’ from B100, only the proposed method recovers the stripes on the body of fish, while lots of artifacts emerge in the area of the fish body in SR images obtained by comparative methods. As for the results of the ‘Belmondo’ from Manga109, we can see that the words ‘COMICS’ recovered by other methods are blurred, while our method obtains a clear word. Similarly, the result of the ‘img_099’ from Urban100 obtained by our method is also clearer than those obtained by other methods. As it shows, our MRRN can achieve more faithful results than other methods, which demonstrates the superior performance of the proposed network over comparative methods.

Table 5
The performance of different methods for scale factors ×2, ×3 and ×4 SR on BI degradation model. Parameters are counted on networks for ×4 task. The average time is tested on Urban100 for ×4 SR

Dataset Scale Bicubic SRCNN VDSR DRRN MemNet LapSRN SRMD MRRN MRRN+

Complexity Time(s) – 26.56 0.3863 0.0135 0.2664 0.2019 0.0314 0.0512 0.6270

Params. – 7 K 0.66 M 0.30 M 2.91 M 0.81 M 1.5 M 5.07 M 5.07 M

Set5 ×2 33.66/.9299 36.66/.9542 37.53/.9590 37.74/.9591 37.78/.9597 37.52/.9591 37.79/.9601 38.03/.9605 38.10/.9608

×3 30.39/.8682 32.75/.9090 33.67/.9210 34.03/.9244 34.09/.9248 33.82/.9277 34.12/.9254 34.40/.9270 34.52/.9281

×4 28.42/.8104 30.48/.8628 31.35/.8830 31.68/.8888 31.74/.8893 31.54/.8850 31.96/.8925 32.17/.8947 32.29/.8963

Set14 ×2 30.24/.8688 32.45/.9067 33.05/.9130 33.23/.9136 33.28/.9142 33.08/.9130 33.32/.9159 33.55/.9175 33.70/.9191

×3 27.55/.7742 29.30/.8215 29.78/.8320 29.96/.8349 30.00/.8350 29.87/.8320 30.04/.8382 30.33/.8425 30.43/.8441

×4 26.00/.7027 27.50/.7513 28.02/.7680 28.21/.7721 28.26/.7723 28.19/.7720 28.35/.7787 28.57/.7815 28.66/.7833

B100 ×2 29.56/.8431 31.36/.8879 31.90/.8960 32.05/.8973 32.08/.8978 31.08/.8950 32.05/.8985 32.19/.8998 32.24/.9003

×3 27.21/.7385 28.41/.7863 28.83/.7990 28.95/.8004 28.96/.8001 28.82/.7980 28.97/.8025 29.09/.8052 29.15/.8062

×4 25.96/.6675 26.90/.7101 27.29/.7260 27.38/.7284 27.40/.7281 27.32/.7270 27.49/.7337 27.57/.7363 27.63/.7377

Urban 100 ×2 26.88/.8403 29.50/.8946 30.77/.9140 31.23/.9188 31.31/.9195 30.41/.9101 31.33/.9204 32.20/.9294 32.37/.9308

×3 24.46/.7349 26.24/.7989 27.14/.8290 27.53/.8378 27.56/.8376 27.07/.8280 27.57/.8398 28.21/.8536 28.36/.8562

×4 23.14/.6577 24.52/.7221 25.18/.7540 25.44/.7638 25.50/.7630 25.21/.7560 25.68/.7731 26.11/.7868 26.25/.7899

Mangan 109 ×2 30.30/.9339 35.60/.9633 37.22/.9750 37.60/.9736 37.72/.9740 37.27/.9740 38.07/.9761 38.67/.9711 38.89/.9776

×3 26.95/.8556 30.48/.9117 32.01/.9340 32.42/.9359 32.51/.9369 32.21/.9350 33.00/.9403 33.47/.9440 33.77/.9457

×4 24.89/.7866 27.58/.8555 28.83/.8870 29.18/.8914 29.42/.8942 29.09/.8900 30.09/.9024 30.36/.9074 30.67/.9104

Dataset	Scale	Bicubic	SRCNN	VDSR	DRRN	MemNet	LapSRN	SRMD	MRRN	MRRN+
Complexity	Time(s)	–	26.56	0.3863	0.0135	0.2664	0.2019	0.0314	0.0512	0.6270
	Params.	–	7 K	0.66 M	0.30 M	2.91 M	0.81 M	1.5 M	5.07 M	5.07 M
Set5	×2	33.66/.9299	36.66/.9542	37.53/.9590	37.74/.9591	37.78/.9597	37.52/.9591	37.79/.9601	38.03/.9605	38.10/.9608
	×3	30.39/.8682	32.75/.9090	33.67/.9210	34.03/.9244	34.09/.9248	33.82/.9277	34.12/.9254	34.40/.9270	34.52/.9281
	×4	28.42/.8104	30.48/.8628	31.35/.8830	31.68/.8888	31.74/.8893	31.54/.8850	31.96/.8925	32.17/.8947	32.29/.8963
Set14	×2	30.24/.8688	32.45/.9067	33.05/.9130	33.23/.9136	33.28/.9142	33.08/.9130	33.32/.9159	33.55/.9175	33.70/.9191
	×3	27.55/.7742	29.30/.8215	29.78/.8320	29.96/.8349	30.00/.8350	29.87/.8320	30.04/.8382	30.33/.8425	30.43/.8441
	×4	26.00/.7027	27.50/.7513	28.02/.7680	28.21/.7721	28.26/.7723	28.19/.7720	28.35/.7787	28.57/.7815	28.66/.7833
B100	×2	29.56/.8431	31.36/.8879	31.90/.8960	32.05/.8973	32.08/.8978	31.08/.8950	32.05/.8985	32.19/.8998	32.24/.9003
	×3	27.21/.7385	28.41/.7863	28.83/.7990	28.95/.8004	28.96/.8001	28.82/.7980	28.97/.8025	29.09/.8052	29.15/.8062
	×4	25.96/.6675	26.90/.7101	27.29/.7260	27.38/.7284	27.40/.7281	27.32/.7270	27.49/.7337	27.57/.7363	27.63/.7377
Urban 100	×2	26.88/.8403	29.50/.8946	30.77/.9140	31.23/.9188	31.31/.9195	30.41/.9101	31.33/.9204	32.20/.9294	32.37/.9308
	×3	24.46/.7349	26.24/.7989	27.14/.8290	27.53/.8378	27.56/.8376	27.07/.8280	27.57/.8398	28.21/.8536	28.36/.8562
	×4	23.14/.6577	24.52/.7221	25.18/.7540	25.44/.7638	25.50/.7630	25.21/.7560	25.68/.7731	26.11/.7868	26.25/.7899
Mangan 109	×2	30.30/.9339	35.60/.9633	37.22/.9750	37.60/.9736	37.72/.9740	37.27/.9740	38.07/.9761	38.67/.9711	38.89/.9776
	×3	26.95/.8556	30.48/.9117	32.01/.9340	32.42/.9359	32.51/.9369	32.21/.9350	33.00/.9403	33.47/.9440	33.77/.9457
	×4	24.89/.7866	27.58/.8555	28.83/.8870	29.18/.8914	29.42/.8942	29.09/.8900	30.09/.9024	30.36/.9074	30.67/.9104

Fig. 6

The visual comparison of different methods for scale factor ×4 SR on BI degradation. The three image are ‘8023’ from B100, ‘Belmondo’ from Manga109, and ‘img_099’ from Urban100, respectively.

Results on BD and DN models. Following [47], we also test our MRRN on BD and DN degradation models for ×3 factor SR. The BD degradation benchmarks are obtained by blurring the HR images and Gaussian kernel of size 7 × 7 with a standard deviation of 1.6. We apply bicubic down-sampling in HR images and add a Gaussian noise with a noise level of 30 to obtain the DN degradation benchmarks. The proposed MRRN for BD and DN degradation models are finetuned on the network for BI degradation models. Similarly, 6 state-of-the-art methods are compared with the proposed method, including SRCNN [9], FSRCNN [8], VDSR [16], IRCNN_C [44], IRCNN_G [44], and SRMDNF [45]. The quantitative results are reported in Table 6. It is obvious that our MRRN+ and MRRN also achieve better results than other methods. The visual effects of BD and DN test benchmarks are revealed in Fig. 7. Obviously, the results of test images with BD degradation obtained by SRCNN, VDSR, and DRRN are blurred. Compared with other methods, our MRRN alleviates blurring artifacts and recovers clear texture for BD degradation images. In addition, our MRRN can handle the noise in LR as shown in Fig. 8, while other methods cannot remove noise very well. Therefore, the proposed MRRN is more effective and robust than other methods on BD and DN degradation images.

Table 6

The performance of different methods for scale factor ×3 SR on BD and DN degradation models

Dataset	Model	Bicubic	SRCNN	VDSR	DRRN	MemNet	LapSRN	SRMD	MRRN	MRRN+
Set5	BD	28.34/.8161	31.63/.8888	26.23/.8124	33.30/.9159	33.38/.9182	29.55/.8246	34.09/.9242	34.51/.9272	34.64/.9283
	DN	24.14/.5445	27.16/.7672	24.18/.6932	27.72/.7872	24.85/.7205	26.18/.7430	27.74/.8026	28.51/.8169	28.56/.8180
Set14	BD	26.12/.7106	28.52/.7924	24.44/.7106	29.67/.8269	29.73/.8292	27.33/.7135	30.11/.8364	30.47/.8434	30.56/.8449
	DN	23.14/.4828	25.49/.6580	23.02/.5856	25.92/.6786	23.84/.6091	24.68/.6300	26.13/.6974	26.60/.7114	26.65/.7127
B100	BD	26.02/.6733	27.76/.7526	24.86/.6832	28.63/.7903	28.65/.7922	26.46/.6572	28.98/.8009	29.17/.8063	29.23/.8074
	DN	22.94/.4461	25.11/.6151	23.41/.5556	25.52/.6345	23.89/.5688	24.52/.5850	25.64/.6495	25.94/.6595	25.97/.6603
Urban 100	BD	23.20/.6661	25.31/.7612	22.04/.6745	26.75/.8145	26.77/.8154	24.89/.7172	27.50/.8370	28.31/.8550	28.48/.8578
	DN	21.63/.4701	23.32/.6500	21.15/.5682	23.83/.6797	21.96/.6018	22.63/.6205	24.28/.7092	24.96/.7393	25.05/.7418
Manga 109	BD	25.03/.7987	28.79/.8851	23.04/.7927	31.66/.9260	31.15/.9245	28.68/.8574	32.97/.9391	33.96/.9457	34.28/.9474
	DN	23.08/.5448	25.78/.7889	22.39/.7111	26.41/.8130	23.18/.7466	24.74/.7701	26.72/.8424	27.96/.8591	28.10/.8616

Fig. 7

The visual comparison of different methods for scale factor ×3 SR on BD degradation model.

Fig. 8

The visual comparison of different methods for scale factor ×3 SR on DN degradation model.

4.7 Analysis of network efficiency

The algorithm complexity of different methods is analyzed in Table 5. The inference time and parameters of network are used to evaluate the time and spatial complexity, respectively. Although our MRRN occupies the most spatial resource, its running time is much less than DRRN, MemNet, and LapSRN. Besides, the running time of MRRN is very close to the methods with a few parameters. This is attributed to the multi-resolution feature learning strategy, which can decrease the computation complexity by reducing the resolution of features.

5 Conclusion

In this paper, we proposed MRRN to improve image resolution for better cognition of images in cognitive computing systems. Imitating the image pyramid decomposition and reconstruction, the proposed MRRN achieved a features pyramid in the basic building units, i.e., MRRU. The cores of feature pyramid are multi-resolution feature learning (MRFL) and transfer skip links (TSL). Proved by ablation studies, MRFL is key to reduce the computational cost, while TSL improves the quality of reconstruction features and ensures superior performance. The combination of MRFL and TSL enables the proposed network to reduce the computational complexity while achieving satisfying performance. Compared with state-of-the-art methods on several benchmarks, the proposed MRRN has achieved the best performance, which demonstrates the superiority of MRRN over these state-of-the-art methods.

Footnotes

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (grant no. 61711540303 and 61701327.)

References

Agustsson

and Timofte

, Ntire 2017 challenge on single image super-resolution: Dataset and study, In CVPRW, (2017).

Badrinarayanan

, Kendall

and Cipolla

, Segnet: A deep convolutional encoder-decoder architecture for image segmenttion, IEEE TPAMI 39(12) (2017), 2481–2495.

Bevilacqua

, Roumy

, Guillemot

and Alberi

M.L.

, Morel, Low-complexity single-image super-resolution based on nonnegative neighbor embedding, In BMVC, (2012).

Bonanomi

, Balletti

, Lecca

, Anisetti

, Rizzi

and Damiani

, I3d: a new dataset for testing denoising and demosaicing algorithms, MTAP (2018), pp. 1–28.

Casolla

, Cuomo

, Di Cola

V.S.

and Piccialli

, Exploring unsupervised learning techniques for the internet of things, IEEE T Industr Inform (2019).

Chen

, Yang

, Jeon

, Anisetti

and Liu

, Atrusted medical image super-resolution method based on feedback adaptive weighted dense network, Artificial Intelligence in Medicine (2020).

Chen

, Wang

, Peng

, Zhang

, Yu

and Sun

, Cascaded pyramid network formulti-person pose estimation, In CVPR, (2018), pp. 7103–7112.

Dong

, Loy

C.C.

and Tang

, Accelerating the super-resolution convolutional neural network, In ECCV (2016), pp. 391–407.

Dong

, Loy

C.C.

, He

and Tang

, Image super-resolution using deep convolutional networks, TPAMI (2016).

10.

Guo

, Hu

, Ye

and Zhang

, Multi-frame super-resolution reconstruction via kernel regression regularized sparse learning, J Intell fuzzy syst 33(5) (2017), 3051–3058.

11.

Guo

, Seyed

H.M.T.

, Huu

and Monga

, Deep wavelet prediction for image super-resolution, In Proceedings of the CVPR Workshop (2017), pp. 104–113

12.

, Zhang

, Ren

and Sun

, Identity mappings in deep residual networks, In ECCV, pp. 630–645. Springer, (2016).

13.

Huang

, Singh

and Ahuja

, Single image super-resolution from transformed self-exemplars, In CVPR, (2015).

14.

Jeon

, Anisetti

, Lee

, Bellandi

, Damiani

and Jeong

, Concept of linguistic variable-based fuzzy ensemble approach: application to interlaced hdtv sequences, T F Syst 17(6) (2009), 1245–1258.

15.

Jeon

, Anisetti

, Wang

and Damiani

, Locally estimated heterogeneity property and its fuzzy filter application for deinterlacing, Inform Sci 354 (2016), 112–130.

16.

Kim

, Lee

J.K.

and Lee

K.M.

, Accurate image super-resolution using very deep convolutional networks, In CVPR, (2016).

17.

Kim

, Lee

J.K.

and Lee

K.M.

, Deeply-Recursive Convolutional Network for Image Super-Resolution, In CVPR, (2016).

18.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint, (2014).

19.

Lai

, Huang

, Ahuja

and Yang

, Deep laplacian pyramid networks for fast and accurate superresolution, In CVPR, (2017).

20.

Le Cun

, Bengio

and Hinton

, Deep learning, Nature 521(7553) (2015), 436–444.

21.

Ledig

, Theis

, Huszár

, Caballero

, Cunningham

, Acosta

, Aitken

, Tejani

, Totz

, Wang

, et al., Photo-realistic single image super-resolution using a generative adversarial network, In CVPR, (2017), pp. 4681–4690.

22.

, Fang

, Mei

and Zhang

, Multi-scale residual network for image super-resolution, In Proceedings of the ECCV (2018), pp. 517–532.

23.

, Yang

, Liu

, Yang

, Jeon

and Wu

, Feedback network for image super-resolution, In CVPR, (2019), pp. 3867–3876.

24.

Lim

, Son

, Kim

, Nah

and Lee

K.M.

, Enhanced deep residual networks for single image super-resolution, In CVPRW, (2017).

25.

Martin

, Fowlkes

, Tal

, Malik

, et al., A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, In ICCV, (2001).

26.

Matsui

, Ito

, Aramaki

, Fujimoto

, Ogawa

, Yamasaki

and Aizawa

, Sketch-based manga retrieval using manga109 dataset. MTAP, (2017).

27.

Newell

, Yang

and Deng

, Stacked hourglass networks for human pose estimation, In European Conference on Computer Vision, pp. 483–499. Springer, (2016).

28.

Pan

, Yu

, Huang

, Hu

, Zhang

, Ma

and Sun

, Super-resolution based on compressive sensing and structural self-similarity for remote sensing images, TGRS, (2013).

29.

Piccialli

, Casolla

, Cuomo

, Giampaolo

and Di Cola

V.S.

, Decision making in iot environment through unsupevised learning, IEEE Intell Syst (2019).

30.

Piccialli

, Cuomo

, Di Cola

V.S.

, and G.C. A machine learning approach for iot cultural data, J Amb Intel Hum Comp (2019), 1–12.

31.

Schulter

, Leistner

and Bischof

, Fast and accurate image upscaling with super-resolution forests, In CVPR, (2015), 3791–3799.

32.

Shi

, Caballero

, Huszár

, Totz

, Aitken

A.P.

, Bishop

, Rueckert

and Wang

, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, In CVPR, (2016).

33.

Singh

and Ahuja

, Learning ramp transformation for single image super-resolution, CVIU.

34.

Sun

, Xiao

, Liu

and Wang

, Deep high-resolution representation learning for human pose estimation, In Proceedings of CVPR, (2019), pp. 5693–5703.

35.

Tai

, Yang

and Liu

, Image super-resolution via deep recursive residual network, In CVPR, (2017).

36.

Tai

, Yang

, Liu

and Xu

, Memnet: A persistent memory network for image restoration, In ICCV, (2017).

37.

Wheeler

F.W.

, Liu

and Tu

P.H.

, Multi-frame super-resolution for face recognition, In IEEE Int Conf Biom pp. 1–6. IEEE, (2007).

38.

, Anisetti

, Wu

, Damiani

and Jeon

, Bayer demosaicking with polynomial interpolation, TIP 25(11) (2016), 5369–5382.

39.

, Li

and Lin

, Medical image restoration method via multiple nonlocal prior constraints, J Intell Fuzzy Syst 38(1) (2020), 5–19. IOS Press.

40.

Song

, Guo

, Wu

, Yang

, Wang

J.e.

and Tao

, A new model for quorum sensing and image simulation of plant rhizosphere microorganisms, J Intell Fuzzy Syst 37(1) (2019), 263–274.

41.

Yang

, Wright

, Huang

T.S.

and Ma

, Image super-resolution via sparse representation, TIP 19(11) (2010), 2861–2873.

42.

Zareapoor

, Shamsolmoali

and Yang

, Learning depth super-resolution by using multi-scale convolutional neural network, J Intell Fuzzy Syst 36(2) (2019), 1773–1783.

43.

Zeyde

, Elad

and Protter

, On single image scale-up using sparse-representations, In Curves and Surfaces, (2010).

44.

Zhang

, Zuo

, Gu

and Zhang

, Learning deep CNN denoiser prior for image restoration, In CVPR, (2017).

45.

Zhang

, Zuo

and Zhang

, Learning a single convolutional super-resolution network for multiple degradations, In CVPR, (2017).

46.

Zhang

and Wu

, An edge-guided image interpolation algorithm via directional filtering and data fusion, TIP (2006).

47.

Zhang

, Tian

, Kong

, Zhong

and Fu

, Residual dense network for image super-resolution, In CVPR (2018).

48.

Zhong

, Shen

, Yang

, Lin

and Zhang

, Joint subbands learning with clique structures forwavelet domain super-resolution, In NIPS (2018).