Abstract
Data cognition plays an important role in cognitive computing. Cognition of low-resolution (LR) image is a long-stand problem because LR images have insufficient information about objects. For better cognition of LR images, a multi-resolution residual network (MRRN) is proposed to improve image resolution in this paper for cognitive computing systems. In MRRN, a multi-resolution feature learning (MRFL) strategy is introduced to achieve satisfying performance with low computational costs. Inspired by image pyramids, a feature pyramid is designed to implement multi-resolution feature learning in the building unit of the proposed MRRN. Specifically, multi-resolution residual units (MRRUs) are introduced as the building units of the proposed network, which consist of a feature pyramid decomposition stage and a feature reconstruction stage. To obtain informative features, transferred skip links (TSLs) are utilized to transfer fine-grain residual features in the pyramid decomposition stage to the reconstruction stage. The effectiveness of MRFL and TSL is demonstrated by ablation experiments. Also, the tests on standard benchmarks indicate the superiority of the proposed MRRN over other state-of-the-art methods.
Keywords
Introduction
A key in cognitive computing is understanding the input data. Visual data, such as images and videos, accounts for the vast majority of human acceptance information [4]. Also, it is an important part of the input data of cognitive computing systems. As reported in [6, 38], images with higher resolution enable recognition algorithms to obtain better results, which means that higher resolution contributes to better cognition of images. This is because much more significant structures and texture details are provided by images with high resolution (HR) than low resolution (LR). Although HR images facilitate machine cognition and subsequent analysis, the acquisition of HR images is always limited by lots of factors, such as expensive hardware devices. Therefore, to help cognitive computing systems obtain better cognition of LR images with low costs, an image super-resolution (SR) method based on deep learning is proposed to improve image resolution in this paper.
Image super-resolution (SR) is an effective and promising alternative technique to obtain an HR image from a LR one in a software manner. Existing SR methods include interpolation [46], reconstruction [39], and learning [9, 42] -based methods. The interpolation-based SR methods are efficient, whereas blurred edges are inevitable in result images obtained by those methods. The reconstruction-based methods are time-consuming, and their performance heavily depends on the parameter settings. As for learning-based SR methods, there are two types of them, which are conventional [30] and deep [5, 29] learning-based methods, respectively. Conventional learning-based methods, e.g., sparse representation [41], compressed sensing [28], and random forest [31] -based methods, have limited representation ability. Benefiting from the strong representation ability, deep learning-based SR methods recently have made significant improvement in the SR performance [9, 48].
However, most existed deep learning-based SR methods hardly consider multi-resolution features learning. Li [22] et al. proposed a multi-scale residual network to capture multi-scale information by convolutional layers with different kernel sizes. Nevertheless, the features in MSRN are still in one resolution. Although Guo [11] et al. combined the wavelet transform and convolutional neural network for image SR. However, the wavelet transform was only adopted to transform images, and then the transformed images were used to supervise network training. Inspired by multi-resolution analysis in classical image processing, such as the pyramid decomposition, we suggest that the LR features can represent coarse information of features and the HR residual features can represent the fine-grain information of features.
To explore multi-resolution feature learning for image SR, we design a feature pyramid and propose a multi-resolution residual network. The multi-resolution feature learning strategy can enable the network to obtain competitive results with general feature learning with fewer computational costs. Specifically, we propose a multi-resolution residual unit (MRRU) as the building unit of our MRRN, in which a feature pyramid is devised to learn multi-resolution features. In MRRU, multi-resolution features are firstly obtained by residual feature extracting groups (RFEG) in feature pyramid decomposition stage. Then, HR features will be reconstructed in the reconstruction stage by feature refining groups (FRG). The purpose of multi-resolution features is to reduce computational costs as well as learn the relationship among different resolutions, which is beneficial to obtaining an advanced mapping representation from LR to HR. To recover informative HR features, transferred skip links (TSL) are adopted to transfer hierarchical feature to the reconstruction stage from corresponding level in decomposition level. In summary, the contributions of this paper are summarized as follows: An effective image multi-resolution residual network (MRRN) is proposed to improve the image resolution for better cognition of LR images in cognitive computing systems. Imitating image pyramid decomposition and reconstruction, feature pyramid decomposition and reconstruction are designed in the building unit (i.e., MRRU) of the proposed network. Combining the multi-resolution features learning (MRFL) and transferred skip links (TSL), the proposed network can obtain satisfying performance with much fewer computational costs.
The subsequent structure of this paper is organized as follows. First, some previous work related to our method is described in Section 2. Then, we describe the baseline and the proposed MRRN in Section 3. The experimental results are presented in Section 4. Finally, the conclusion is summarized in Section 5.
Related work
Deep learning based super-resolution
Dong et al. [9] proposed the first deep convolution neural network for image SR named SRCNN, which only contained three layers. The three layers corresponded to LR feature extracting, non-linear mapping, and image reconstructing in the sparse representation based method, respectively. Subsequently, some advanced methods focused on designing a deeper or wider network to improve the performance of SR [16, 47]. Kim et al. [16] presented a global residual learning strategy and a gradient clipping strategy in their VDSR. Besides, they also came up with a deep recurrent neural network (DRCN) for reducing parameters of deep SR networks [17]. However, these methods had a common deficiency that the input LR image was interpolated to HR size. Thus, the model would take more computing time than the same model with LR size input. To make networks output an HR image with an LR size input, Dong et al. [8] utilized deconvolution to up-sample feature maps in FSRCNN. Shi et al. [32] also proposed a pixel-shuffle layer for up-sampling feature maps. Recently, some state-of-the-art methods have been proposed for image SR, including DRRN [35], MemNet [36], SRResNet [21], and SRMD [45].
Although there are lots of SR networks, the proposed network differs from them in two aspects. First, a multi-resolution feature learning strategy is utilized in the building unit, while most of the other SR networks only use features of one resolution.
Second, a modified residual block is introduced in our multi-resolution residual unit, which can directly back-propagate gradient of deep layers to shallow layers when the number of input channels is different from that of output channels in residual blocks.
Residual learning
There are two types of residual learning: global residual learning [16] and local residual learning [12]. The global residual learning was introduced by VDSR [16] to learn the residual image of the up-sampled LR image, while the local residual learning was introduced by ResNet [12] to learn the residual features. Besides, ResNet proposed a classic building block, i.e., residual block. Adopting the residual block, Ledig et al. [21] proposed SRResNet and SRGAN for image SR. Since the batch normalization in the original residual block will change the diversity of features values [24], which is not suitable for image SR. Lim et al. [24] modified the vanilla residual block and proposed EDSR for image SR. Combining local residual and global residual learning in a recurrent unit, DRRN could train a very deep network for image SR. Different from these methods, the proposed network learns residual features from lower resolution features and combines features of two different resolution to learn residual for higher-resolution features.
Multi-resolution networks
There are two categories of multi-resolution networks, i.e., sequential multi-resolution networks [2, 27] and parallel multi-resolution networks [34]. The multi-resolution features are sequential in sequential multi-resolution networks while the multi-resolution features are parallel in parallel multi-resolution networks. Encoder-decoder networks [2] can be viewed as the earliest sequential multi-resolution networks as the resolution of feature maps is reduced in the encoding stage and increased in the decoding stage. Encoder-decoder networks are adopted to various vision tasks, such as segmentation [2]. To improve its performance, Newell et al. [27] proposed an hourglass network to combine the features of the same resolution in encoder and decoder by a skip connection. In cascading pyramid networks [7], multi-resolution features are cascaded to obtain the highest-resolution features. To maintain high-resolution features in the whole network, a high-resolution network utilized parallel multi-resolution features. However, this parallel multi-resolution feature learning requires lots of computational resources because of numerous features and convolutional layers.
Although the multi-resolution residual unit is also an hourglass and pyramid architecture, it is different from those in three aspects. First, those networks are proposed for high-level vision tasks, such as segmentation and human pose estimation, while the proposed method is proposed for image SR, a low-level vision task. Second, the hourglass and pyramid architecture is adopted in multi-resolution residual units. However, the above networks adopt this structure for the entire network. Third, residual groups are used in each resolution stage in MRRUs, while the hourglass network adopts the vanilla convolutional layers.
The proposed method
Preliminary
As illustrated in Fig. 1, the baseline of the proposed network is an EDSR-liked network. The convolutional layer after the last residual block in the original EDSR is removed in the baseline. The pixel-shuffle layer in the original EDSR is replaced by deconvolutional layers in the baseline. There are D × (2G + 1) × R residual blocks in the baseline. The residual blocks can be formulated by following equation:

The architecture of the baseline.
Similar to the baseline, as shown in Fig. 2, the proposed network is constructed by D units (i.e., MRRUs) and each unit consists of 2G + 1 residual groups, where each group includes R residual blocks.

The architecture of our MRRN.
As shown in Fig. 2, the proposed MRRN consists of three parts: feature extractor module (FEM), non-linear mapping module (NLMM), and image reconstruction module (IRM). First, the shallow LR features will be extracted by FEM from the input LR image. Then, the shallow features are fed into NLMM, which consists of several MRRUs. Finally, deep features outputted by NLMM are utilized to generate the HR output. The details of the three steps are described as follows.
We adopt the L1 loss function in the training stage because the convergence speed of networks adopting L1 loss function is faster than networks adopting L2 loss function. Given a training set
Inspired by classic multi-resolution image processing technologies, such as the image pyramid methods, we argue that the features of different resolution also possess information of different levels. For example, after pooling, the resolution of features is reduced and some information is lost. However, the resolution-reduced features can still represent the basic information of original features. The resolution-reduced features are akin to the low-frequency LR images in image pyramids. The residual features between the original features and resolution-reduced features are capable of representing fine-grain information, which are akin to the sub-band images in image pyramids. Based on the inspiration, a feature pyramid is devised in multi-resolution residual unit (MRRU) as shown in Fig. 3-(a). To clearly show the feature pyramid, an illustration of features in MRRU is presented in Fig. 3-(b). As for each MRRU, there are two stages, feature pyramid decomposition and feature reconstruction, which are detailed as follows.

(a) is the architecture of MRRU. (b) is the illustration of feature maps in MRRU.
As shown in Fig. 3-(a), in each FREG, R residual blocks are stacked to extract informative features at current level. Therefore, RFEG can be formulated as follows:
Similar to RFEG, each FRG also consists of R residual blocks. However, as shown in Fig. 3-(a), the input of the first residual block is the concatenation of the up-sampled feature from the preceding residual group and the fine-grain residual from TSL. Therefore, we modify the vanilla residual blocks with identity mapping [12] to handle the case, in which the number of the input channels is different from that of the output channels. The modified residual block is formulated as follows:
The architecture details of the proposed network are shown in Table 1. The numbers of MRRU and FRG are set to 5 and 2 (i.e., D = 5, R = 2), respectively. Two residual blocks (i.e., R = 2) are placed into each group. A single residual block is adopted in TSL, i.e., T = 1. Except the first convolutional layer in modified residual blocks, the number of the input and the output channels of the convolutional layers in residual blocks is set to 64. Their kernel size, stride, and padding are set to 3, 1, and 1, respectively. In REM, three different deconvolutional layers are used to up-sample features with different scales.
The architecture details of the proposed network. Deconv/Conv (C
in
, C
out
, k, s, p) represents the deconvolutional/ convolutional layer with C
in
input channels and C
out
output channels. k, s, and p denote the kernel size, stride and padding, respectively. I-ResBlock denotes the original residual blocks with identity mapping. M-ResBlock denotes the modified residual block
The architecture details of the proposed network. Deconv/Conv (C in , C out , k, s, p) represents the deconvolutional/ convolutional layer with C in input channels and C out output channels. k, s, and p denote the kernel size, stride and padding, respectively. I-ResBlock denotes the original residual blocks with identity mapping. M-ResBlock denotes the modified residual block
Dataset and training setting
We use DIV2K [1] as the training set, and five standard benchmarks as test sets, which include Set5 [3], Set14 [43], B100 [25], Urban 100 [13], and Manga 109 [26]. Both the training set and the test sets are commonly used in SR community. As [47], three types of degradation are adopted for testing, including bicubic down-sampling (denoted as BI), blurring followed by bicubic down-sampling (denoted as BD), and additive Gaussian noise following bicubic down-sampling (denoted as DN). We apply data augmentation to the training set, such as random cropping, flipping, and rotation. In the training stage, a mini-batch contains 16 patches and each patch is cropped to 48 × 48 for training. Adam [18] is utilized to optimize the network, and its initial learning rate is set to 10-4. The total number of iterations of our MRRN is 106, and the learning rate is halved every 2 × 105 iterations. All the experiments in this paper are based on Pytorch by desktops with GTX1080Ti/ RTX2080Ti GPUs.
Investigation of G, R, and T
In this subsection, we investigate the basic hyper-parameters of MRRN: the number of FRG (denoted as G), the number of residual blocks in residual groups (denoted as R), and the number of residual blocks in TSL (denoted as T). To reduce the computing time and resources, the basic number of feature maps is set to 32 (i.e., M = 32), which is same with the setting for ablation study in Section 5.3. When we investigate one of the hyper-parameters, the other two hyper-parameters are fixed. As shown in Table 2, larger G, R, or T leads to higher performance. This is because the network becomes deeper and parameters increase. However, the bigger hyper-parameter of G, R, or T also results in much more computing time and resources consumption. Besides, it can be observed that the performance of the network is more sensitive to R than G and T. Considering a trade-off between performance and computational costs, we set G = 2, R = 2 and T = 1 for the final network.
The performance and parameters analysis for different hyper-parameters
The performance and parameters analysis for different hyper-parameters
To validate the effectiveness of multi-resolution feature learning (MRFL) and transfer skip links (TSL). We analysis the performance and computational costs of four networks: the baseline, MRRN without MRFL, MRRN without TSL, and MRRN. The basic features number of convolutional layers is set to 32 in this experiment. The results are shown as Table 3. First, it is obvious that the network lacking the TSL obtains much worse results than MRRN and the baseline. This is because the network without TSL cannot transfer middle levels in feature pyramid to subsequent FRG to reconstruct HR features, there-by resulting in information loss in HR feature reconstruction. Naturally, the performance obtained by the network without TSL decreases dramatically. This comparison proves the indispensability of TSL in the proposed network. Then, comparing MRRN with the baseline and MRRN without MRFL, we can find that the performance and parameters of them are very close. However, the multiplication and addition operations (MADDs) of the proposed network are much fewer than the baseline and the network without MRFL. This is because MRFL reduces the resolution of features, thereby decreasing lots of computation. Overall, both MRFL and TSL are necessary to achieve the feature pyramid in the proposed network. According to above comparative experiments, it can be shown that TSL facilitates performance improvements while MRFL is beneficial to reducing computational costs.
Investigation of the effectiveness of MRFL and TSL. MADDs are calculated under the task of reconstructing a 1280 × 720 HR RGB image
Investigation of the effectiveness of MRFL and TSL. MADDs are calculated under the task of reconstructing a 1280 × 720 HR RGB image
To investigate different MRFL strategies, we compare the proposed network with a variant. In the variant of MRRN, the positions of the max-pooling operation and the bi-linear interpolation are swapped. In other words, the features are up-sampled by interpolation and then pooled by max-pooling operation in MRRU of the variant. The PSNR curves of these networks on the validate set are shown in Fig. 4. ‘down-up’ and ‘up-down’ refer to the proposed network and the variant, respectively. ‘baseline’ is the network described in Section 3. A quantitative analysis on a test set is presented in Table 4. Although the baseline and the variant obtain higher results than the proposed MRRN, they take more computing time. For example, the training time of the variant is 4.77 times of MRRN. This is mainly because the resolution of middle level features is improved in the variant, which requires more computation and naturally takes more computing time. The performance improvements brought from the variant appear small compared to much more costs of computation. Therefore, the proposed network is more practical in realistic scenarios than the variant.

Study performance of different MRFL strategies on Set5 for ×4 factor SR.
The performance, training time, and test time for ×4 factor SR. The training time is the average time taken to train an epoch for DIV2 K. Test time is the average result for each image in the Urban100
To study different strategies to fuse the LR features from FRG and the transferred feature from TSL, we also compare the proposed MRRN with another variant. In this variant, the modified residual blocks are replaced by original residual blocks and the transferred features from TSL are added to the up-sampled feature from FRG. The number of residual blocks in TSL is set to T = 0 in this experiment. The results of the comparative experiments are depicted in Fig. 5. ‘Base’ and ‘plus’ represent the proposed network without TSL and the variant in this experiment, respectively. From Fig. 5, it is obvious that the proposed MRRN (labeled as ‘concate’) achieves the best results. This is because that the proposed network combines transferred features and LR features to learn reconstruction features, while the variant merely adopts the transferred features as residual of LR features. Combining the transferred features and LR features enables the FRG to receive both the LR feature and transferred features, thereby implicitly grasping the relationship of different resolution. Therefore, the proposed method can obtain better results than the variant.

Study performance of different fusion strategies for TSL on Set5 for ×4 factor SR.
The performance of different methods for scale factors ×2, ×3 and ×4 SR on BI degradation model. Parameters are counted on networks for ×4 task. The average time is tested on Urban100 for ×4 SR
The performance of different methods for scale factors ×2, ×3 and ×4 SR on

The visual comparison of different methods for scale factor ×4 SR on BI degradation. The three image are ‘8023’ from B100, ‘Belmondo’ from Manga109, and ‘img_099’ from Urban100, respectively.
The performance of different methods for scale factor ×3 SR on

The visual comparison of different methods for scale factor ×3 SR on BD degradation model.

The visual comparison of different methods for scale factor ×3 SR on DN degradation model.
The algorithm complexity of different methods is analyzed in Table 5. The inference time and parameters of network are used to evaluate the time and spatial complexity, respectively. Although our MRRN occupies the most spatial resource, its running time is much less than DRRN, MemNet, and LapSRN. Besides, the running time of MRRN is very close to the methods with a few parameters. This is attributed to the multi-resolution feature learning strategy, which can decrease the computation complexity by reducing the resolution of features.
Conclusion
In this paper, we proposed MRRN to improve image resolution for better cognition of images in cognitive computing systems. Imitating the image pyramid decomposition and reconstruction, the proposed MRRN achieved a features pyramid in the basic building units, i.e., MRRU. The cores of feature pyramid are multi-resolution feature learning (MRFL) and transfer skip links (TSL). Proved by ablation studies, MRFL is key to reduce the computational cost, while TSL improves the quality of reconstruction features and ensures superior performance. The combination of MRFL and TSL enables the proposed network to reduce the computational complexity while achieving satisfying performance. Compared with state-of-the-art methods on several benchmarks, the proposed MRRN has achieved the best performance, which demonstrates the superiority of MRRN over these state-of-the-art methods.
Footnotes
Acknowledgments
This work is sponsored by the National Natural Science Foundation of China (grant no. 61711540303 and 61701327.)
