SuperVidConform: Conformation detail-preserving network (CDPN) for video super-resolution

Abstract

Video Super Resolution (VSR) applications extensively utilize deep learning-based methods. Several VSR methods primarily focus on improving the fine-patterns within reconstructed video frames. It frequently overlooks the crucial aspect of keeping conformation details, particularly sharpness. Therefore, reconstructed video frames often fail to meet expectations. In this paper, we propose a Conformation Detail-Preserving Network (CDPN) named as SuperVidConform. It focuses on restoring local region features and maintaining the sharper details of video frames. The primary focus of this work is to generate the high-resolution (HR) frame from its corresponding low-resolution (LR). It consists of two parts: (i) The proposed model decomposes confirmation details from the ground-truth HR frames to provide additional information for the super-resolution process, and (ii) These video frames pass to the temporal modelling SR network to learn local region features by residual learning that connects the network intra-frame redundancies within video sequences. The proposed approach is designed and validated using VID4, SPMC, and UDM10 datasets. The experimental results show the proposed model presents an improvement of 0.43 dB (VID4), 0.78 dB (SPMC), and 0.84 dB (UDM10) in terms of PSNR. Further, the CDPN model set new standards for the performance of self-generated surveillance datasets.

Keywords

Super-resolution image super-resolution video super-resolution recurrent network residual learning

1 Introduction

The rising popularity of the internet and social media is fueling an increasing need for image and video processing methods. The task of image processing in the computer vision domain involves techniques to enhance and recover a high-resolution (HR) image from a low-resolution (LR) counterparts [1]. Super Resolution (SR) is an emerging technique for increasing the spatial resolution of media files (image and video). The field of SR has been divided into two categories: video super-resolution (VSR) and image super-resolution (ISR), depending on the quantity of input frames. Most application domains like surveillance, academia, medical, and business industries require pleasant and high visual quality. It is vital because low-quality frames can often make it challenging to identify crucial information, especially in the surveillance field, which relates to security [2, 3]. VSR models trained on high-quality datasets boost the accuracy of video surveillance while also enhancing video quality in time sensitive manner. Therefore, scaling up and creating HR images from low-resolution images is necessary using VSR techniques. The ISR methods understand the relationships between frames and create more precise and detailed results by utilizing available patterns. VSR, in a broad sense, can be seen as an expansion of ISR where algorithms process frame-by-frame information without additional hardware demand.

Many VSR algorithms have been put forth in recent years; traditional approaches [4–6] and deep learning (DL) methods [7–11] make up most of them. Unlike other methods, learning-based methods—such as sparse representation and example-based use data to learn the mapping from LR to HR. [12, 13] used traditional approaches with the kernel regression method for motion estimation. Liu and Sun [14] proposed a Bayesian approach that simultaneously estimates motion, blur kernel and noise level to reconstruct HR frames. Similarly, Ma et al. [15] employed the expectation-maximization (EM) method to calculate the blur kernel while ensuring improved recovery of HR frames. However, these models are still insufficient to accommodate different video scenarios. DL-based approaches [7–9, 16–18] typically perform well on many publicly available benchmark datasets due to the nonlinear learning potential of deep neural networks. Moreover, DL had remarkable success in many fields [7–11] and has received much attention.

Deep neural networks, including convolutional neural networks (CNN), generative adversarial networks (GAN), and recurrent neural networks (RNN), have been widely used to enhance video resolution [10]. Several existing DL models mainly focused on increasing CNN depth to perform better. However, adding more layers to a neural network does not necessarily guarantee an improvement in performance, and it can lead to difficulties during training, such as vanishing or exploding gradients [13]. It can be challenging to restore a HR video from its LR counterpart with precision due to ill-posed issues in VSR [10, 11]. Existing literature shows two types of motion compensation (MC) VSR approaches: 1) explicit based [8, 9, 17, 19, 20] and 2) implicit based [21–25] MC methods. Kappeler et al. [19] suggest warping all nearby frames to reference optical flow estimates. The VESCPN [26] approach uses a combination of spatial-temporal networks to create a new approach for VSR. However, inadequate motion estimate and alignment in these methods can lead to artifacts. Additionally, estimating optical flow can be computationally intensive and limit the practical use of these methods. In contrast, implicit motion compensation-based methods do not require estimating or aligning motion between frames. According to the studies above, even though VSR has made remarkable strides, the three primary problems are still remaining in DL-based VSR methods:

1) Inadequate visual perception performance: Most existing DL-based VSR techniques yield perceptually undesirable results that frequently have simplified textures and exclude higher-frequency information [24, 26]. They generally produce unnatural outcomes with artefacts, including blurring and aliasing.

2) Increased computational overloading: Most existing VSR methods [9, 20] depend on motion estimation, and motion compensation (MEMC) discovered that most of them use the optical flow algorithm. However, estimating these values raises the model computational cost.

3) It is frequently impossible to achieve good visual quality and high accuracy simultaneously: Methods frequently produce outcomes of either high accuracy with less visual quality [8, 22] or low accuracy with high visual quality [27, 28].

In this study, authors have proposed a new method to capture features to produce high-quality frames precisely. The CDPN method prioritizes accuracy and visually pleasing results compared to other networks. The network is designed to learn the structural details and the fine details from the ground-truth HR frame to achieve high accuracy and detailed reconstruction. The proposed CDPN method can recover more structural details with high accuracy. The primary contributions of this research article are listed as follows:

1. Proposed a novel conformation detail preserving network (CDPN) to learn structural information from the ground-truth HR image, which allows it to achieve high accuracy and visually pleasing results, named SuperVidConform.

2. We achieved the optimal learning performance based on a recurrent residual network (RRN) with a hidden state. The network utilization of motion information is optimized implicitly, which leads to superior performance compared to previous state-of-the-art methods on publicly available benchmark datasets.

3. The proposed method is evaluated on a surveillance dataset, and the results show that it performs better with the lightweight recurrent architecture.

The rest of the paper is structured as follows: Section 2 overviews related work. Additionally, the architecture of CDPN is described step-by-step in Section 3. The results of the experiments are presented in Section 4, and finally, Section 5 concludes with a summary of the findings.

2 Related work

As mentioned in Section 1, VSR is a challenging problem. Restoring an HR image from its LR counterpart is difficult, as multiple possible solutions exist for any given LR image. It is also hard to recover all the detailed and overall information in the original LR image. However, videos have temporal correlations among adjacent frames, which can be helpful for super-resolution. This has motivated the development of several recent VSR approaches. [1, 13, 29–32] aiming to exploit this temporal correlation effectively. Hence, this paper explores VSR models from three aspects: Temporal correlation, exploitation of motion information, and recurrent network.

2.1 Temporal correlation

Temporal correlations built into the VSR architecture [22, 25, 26, 33–35] are generally famous for end-to-end learning frameworks. Two strategies are widely used to model temporal information: Temporal Concatenation and Temporal Aggregation. Former is a widely used VSR method [17, 19, 20, 36] that concatenates several frames for preserving temporal information. Using multiple input images is an expansion of ISR using this method. However, the concatenation method can lead to difficult training as it fails to represent various motion regimes within a single input sequence correctly. Later uses temporal aggregation in which multiple SR inferences operate in various motion regimes as proposed by [8, 11, 22, 28] to overcome the dynamic motion issue in VSR. The final layer constructs an SR frame by combining the results of all branches

2.2 Exploitation of motion information

Several DL methods have recently been proposed to tackle the VSR problem explicitly and implicitly, such as [30, 37]. Most VSR methods use explicit MC, which takes a direct channel as MEMC with information fusion and upsampling. These methods include a joint motion compensation module proposed by VESPCN [26] and an SPMC module introduced by Tao et al. [20], task-oriented flow modules proposed by Xue et al. [9], and recurrent frame modules proposed by Sajjadi et al. [8]. They estimate motion frame-to-frame and align the results to a reference frame. However, a significant disadvantage of these techniques is high computational cost caused by MEMC. Conversely, implicit MC techniques reduce video SR computational load by implicitly using motion information rather than explicitly. This can be done by incorporating motion information into the network in a way that does not require additional computation. Methods with implicit MC [21–23, 25] develop an improved module that fully utilizes complementary information from different frames. Jo et al. [23] predicted a dynamic upsampling filter to recover the image, while Yi Huang et al. [25] fused spatial-temporal information. Huang et al. [38] used a bidirectional recurrent convolutional network; in [21] Fuoli used recurrent architecture in feature space. The proposed method in this paper utilizes an implicit MC technique to reduce the heavy computational burden of explicit MEMC. This method also takes advantage of the relationship between the current frame and the hidden state to actively use historical data within the hidden state, improving performance and lowering the risk of error accumulation.

2.3 Recurrent neural network

Recently, there has been a lot of use of recurrent networks in tasks related to video processing, such as VSR. These networks [11, 22, 28, 38–40] can handle input and output that involve time by analyzing sequential data by combining data from every frame while keeping track of their individual hidden states. However, a big challenge with this approach is that when working with long sequences of frames, the training process can become complex due to gradient vanishing [41–43]. To address this problem, the proposed method utilizes a Recurrent Residual Network (RRN), which uses a residual mapping between layers and includes identity skip connections, which helps to avoid the gradient vanishing risk in the training process. Therefore, motivated by these VSR aspects of temporal correlation, exploitation of motion information, and handling the level of recurrent network for temporal input and output, this work proposed a general framework for long-range encoding video.

3 Conformation detail preserving network

This section explains the methodology and general architecture of the proposed method for temporal modeling. The system is broken down into three parts: decomposition of the original images, modeling the passage of time, and a method for measuring performance. The process of breaking down the images preserves essential information about their structure. The temporal network blends several successive frames with a benchmark frame as its input. The loss function as a performance measurement helps to improve the network performance by considering motion information.

3.1 Network framework

The proposed CDPN comprises the following components: a decomposition of images to preserve essential details, a module for extracting features, a module for increasing resolution, and a module for combining these elements to produce the final high-resolution result as shown in Fig. 1. The annotation that follows, It, I_t-1, I_{t
_C}, I_{t-1_C}, I_HRC, I_HR, and I_GT, denote the current and previous input, the conformation details of current and previous input, the recovered details of the image, the final high-resolution output, and the original high- resolution image that serves as the reference point, respectively. Equation (1) decomposes the input from the current and previous frames to get corresponding conformation details from the image.

Fig. 1

Schematic illustration of the proposed method.

$F_{oi} = H_{decomp} (I_{t}, I_{t - 1})$ (1) $F_{o 1} = H_{f} (I_{t_{C}}), F_{o 2} = H_{f} (I_{t - 1_{C}})$ where H_Decomp (·) denotes the decomposition operation. Then, the extracted details F_oi (i = 1, 2) put forward to Equation (2) $F_{Dfi} = H_{RRN} (F_{oi}) (i = 1, 2)$ (2)

H_RRN (·) denotes the residual RNN feature extraction module, consisting of residual learning to extract deep features (DF). Furthermore, the extracted detail feature F_DFi is then upscaled via the upsampling module to eliminate the pixelation effect and estimate extra image details. $F_{UPi} = H_{UP} (F_{DFi}) (i = 1, 2)$ (3)

H_UP (·) and F_UPi refer to the module for increasing resolution and the features that have been increased in resolution, respectively. This approach employs a post-resampling method that relies on the high-level detail obtained from the low-level space and yields improved outcomes for VSR than a predefined upsampling method. Then, the recovered conformation details input IHRC was estimated from F_UPi, where i = 1,2. $I_{HRC} = H_{REC} (F_{UPi})$ (4)

Where H_REC (·) stands for the recovered upsampled input. Finally, these recovered conformation details of IHRC are inputted into a composition module to produce the final HR image IHR, where H_comp (·) represents the composition module. $I_{HR} = H_{comp} (I_{HRC})$ (5)

3.2 Conformation detail-learning

Recovering the high-resolution image from its low-resolution counterpart can be difficult due to its ill-posed problem. To overcome this challenge, the authors in this paper utilized the conformation detailing approach, as shown in Fig. 2. This method learns the high-frequency information, such as texture, separately, thereby enhancing the final image quality.

Fig. 2

Illustration of the decomposition process.

The high-frequency information in an image, such as textures, can be extracted by taking the difference between the original image and a blurred version [27, 44]. This method has been used in low-level image processing tasks, such as identifying boundaries and assessing image quality [45].

Motivated by the above findings to recover high-frequency information, the input image is separated into its high-frequency components, which are obtained by subtracting the original image from a version that has been blurred using a Gaussian filter. These high-frequency details are then used in a composition module to generate the final high-resolution image, as shown in Equation 5.

3.3 Recurrent residual network

This work comprehensively studies different temporal modeling frameworks, which include 2D CNN and RNN. Figure 3(a) and (b) illustrate these networks, respectively, and Fig. 3(c) shows the proposed hidden state architecture. The input frames are joined together in a 2D CNN. RNN, on the other hand, uses fewer frames as input and processes a video sequence repeatedly. Typically, the hidden state at a particular time step is composed of three components: the previous output from time step o_t-1, the previously hidden state representation h_t-1, and the two neighboring frames I_t–1,t. Intuitively, pixels in consecutive frames of a video sequence are often similar. The result is improved by using information from the previous layer to refine the t-th time step that contains the high-frequency details. However, like other video processing tasks, RNN in VSR [21] can be affected by the issue of gradient vanishing [41–43]. The recurrent residual network (RRN) is proposed to overcome this problem, which uses residual learning and includes identity skip connections [37].

Fig. 3

Frameworks for modelling temporal data that are frequently used: A) 2D CNN, B) RNN, C) Proposed RRN.

The recurrent residual network (RRN) design ensures that information flows smoothly, allowing it to retain texture information over a long period, making it easier for RNNs to handle longer videos and using a set of equations within the RRN framework, it may reduce the chance of gradient vanishing during training by using each time step is ‘t’ to produce two outputs, h_t, and o_t, which are then used to guide the next time step, ‘t + 1’: $x_{o} = σ (W_{conv 2 D} {[I_{t - 1}, I_{t}, o_{t - 1}, h_{t - 1}]})$ (6) $x_{k} = g (x_{k - 1}) + F (x_{k - 1}),$ (7)

where,

k ∈ [1, K]

h_t = σ (W_conv2D {x_K}) ot = W_conv2D {x_K} Equation 6 uses the ReLU function represented by σ (·). In the k-th residual block, the term g (x_k–1) denotes a simple mapping, indicating that (x_k–1) = x_k–1 in Equation 7. In contrast, the newly obtained residual mapping is denoted by the notation F (x_k–1).

3.4 Loss function

The CDPN is optimized using a training loss function, the sum of all outputs. $L (Θ) = ω_{1} ι^{1} (I_{HR}, I_{GT})$ (8)

The symbol Θ represents the set of parameters for the proposed network, as shown in Fig. 2. The term ι ¹ (I_HR, I_GT) represents the overall loss for the network, where I_HR is the final HR output of the CDPN, and I_GT is the ground-truth HR image. The term ω_k represents the weight of each loss.

In the field of VSR, various loss functions have been proposed to guide the optimization of networks, such as pixel loss [8, 22, 46, 47] (e.g., ι1 loss), content loss [25, 31], etc. The ι ¹ loss function is a simple yet effective method to obtain a high peak signal-to-noise ratio (PSNR) with less complexity. The pixel loss promotes the recovered HR image (I_HR) to be visually similar to its ground truth (I_GT). Thus, to achieve a high PSNR and high-quality regional recovery, and accurate edges, the proposed method uses loss as a ι ¹ loss function: $ι^{1} (I_{HR}, I_{GT}) = ι_{pi x}^{1} (I_{HR}, I_{GT})$ (9)

Specifically, the L1 loss function is defined as follows: Given a set of N training data pairs {I_iHR, I_iGTN_i=1 where I_iHR is a recovered image and I_iGT is the corresponding ground truth image, the ι ¹ loss function measures the difference between the recovered and ground truth images using the ι1 metric. $ι_{pi x}^{1} \frac{1}{N} \sum_{i = 1}^{N} {∥ I_{HR} - I_{GT} ∥}_{1}$ (10)

3.5 Discussions

1) Difference to the CNN Model: CNN models [46, 48] detect image features adaptively. The differences between the CNN-based and proposed CDPN model are as follows: (i) The CNN-based model is PSNR-oriented and focuses on pursuing high accuracy only. In contrast, the CDPN aims to achieve high visual quality and PSNR. Unlike a CNN-based model that only uses gradient maps to recover images, CDPN uses gradient features from the original image to improve reconstruction. Additionally, while the CNN-based model only considers spatial correlation to extract features of LR images, CDPN uses a more comprehensive approach. In contrast, our CDPN uses spatial and temporal correlations in the network to increase performance. (iv) The CNN-based model uses optical flow algorithms for alignment, which increases the complexity, while CDPN does not require any particular alignment algorithm, resulting in reduced complexity. The experimental results are presented in section 4.

2) Difference to the RNN Model: RNN is a recurrent-based VSR model [8, 49, 50], and the key differences between RNN and CDPN are: (i) RNN only has hidden layers stacked, while CDPN has several residual groups with skip connections included, allowing for more low-frequency information to be preserved and inter-dependencies to be accounted for. (ii) RNN only uses contextual information within a limited area and cannot consider local information. In contrast, CDPN can alleviate this issue by introducing the decomposition of the original image, which captures spatial contextual information and conformation details. Section 4 presents proposed experimental results that aim to balance high PSNR and good visual quality.

3) Requirement of residual connection in the hidden state of RNN: Identity mapping is utilized in the hidden state, resulting in an advanced model [45, 48, 50]. The baseline model performs best in PSNR, using the hidden state. However, when the number of blocks increases, it suffers the vanishing gradient issues. Residual connections in the hidden state are added to stabilize progress as the number of blocks increases. The results highlighted that identity mapping helps to improve VSR performance while sustaining training. Adding more blocks can help the RRN function perform even better.

4 Experimental analysis

4.1 Datasets

Authors in this experimental analysis utilized the publicly available Vimeo-90k [9] dataset for model training. This dataset consists of 90k video scenes with high visual quality. Gaussian blur with σ = 1.6 was applied on a dataset, with patch size 64×64, and further down sampling was performed with a 4× scale factor. The performance evaluation for the proposed method was done using VID4 [14], SPMC [20], and UDM10 [25] standard benchmark datasets. Four scenes make up VID4, each with its motion. The most recent validation sets, SPMC and UDM10, feature a variety of senses with frames that are far higher in resolution than VID4. For a fair comparison, the quantitative results are evaluated using PSNR and structure similarity index metrics (SSIM).

4.2 Implementation details

This work considers two models for a given temporal method: CDPN-S and CDPN-L, where S represents five and L represents ten 2D residual blocks, respectively. Each residual block comprises two convolutional layers with ReLU activation in between. The convolutional layer has 128 channels and a 3×3 filter size. This method uses sub-pixel convolution to increase the resolution of the LR features to HR [35]. At the starting time of step t0, the previous estimation is set to zero. The learning rate starts at 1×10–4 and decreases by 0.1 every 60 epochs until 70 epochs to train the CDPN models. The models are trained using a pixel-wise loss function and an Adam [19] optimizer with parameters β1 = 0.9, β2 = 0.999, and a weight decay of 5×10–4. All experiments were conducted using Python 3.6.4 and Pytorch 1.1.

This work comprehensively studies and compares two temporal modeling methods, including 2D CNN and RNN. The proposed model specification comparison is shown in Table 1, with their respective residual blocks. The authors utilized 5 and 10 residual blocks as the hidden states S and L, respectively.

Table 1
Specification comparison of temporal modeling methods

Method 2D CNN Proposed 2D CNN Proposed

S S L L

Blocks 5 5 10 10

Input Frames 7 Recurrent 7 Recurrent

# Param. [M] 2.8 2.1 4.3 3.7

RuntimE [ms] 97 30 116 46

Method	2D CNN	Proposed	2D CNN	Proposed
Blocks	5	5	10	10
Input Frames	7	Recurrent	7	Recurrent
# Param. [M]	2.8	2.1	4.3	3.7
RuntimE [ms]	97	30	116	46

The results are measured using the luminance (Y) channel, with the L1 loss applied to all pixels between the ground truth frames and the network output. Further, PSNR (dB) values of VID4, SPMC and UDM10 benchmark datasets of these temporal modeling methods are given in Table 2.

Table 2

PSNR values of VID4, SPMC, and UDM10

Dataset	2D CNN	Proposed	2D CNN	ProposeD
	S	S	L	L
Vid4	26.72	27.54	26.96	27.91
Spmc	29.05	30.06	29.51	30.37
Udm10	37.67	38.66	38.15	39.32

4.3 Comparisons with state-of-the-arts

This section validates CDPN with other state-of-the-art VSR methods to demonstrate its effectiveness, including Bicubic, FRVSR [8], DUF [23], RBPN [22], and RLSP [21]. For intense comparison, this work uses both explicit and implicit MC methods.

1) Quantitative comparison: Table 3 compares the proposed method quantitatively with state-of-the-art video super-resolution methods with and without alignment methods, such as CNN and RNN. For a fair comparison of reported methods, this work utilized three benchmark datasets with a scale factor of ×4. Specifically, comparing the recovered results at ×4 scale on the VID4, SPMC, and UDM10 datasets wrt to RLSP method, the proposed CDPN show an improvement of 0.43 dB, 0.78 dB, and 0.84 dB, respectively, in terms of PSNR. RLSP also passes historical information like the CDPN method without ME in feature space. The main reason for achieving this optimal performance is improved feature extraction from conformation detail information such as repetitive patterns, edges, and textures. In many VSR methods, recovering this information is a difficult task. Moreover, the conformation detail-learning enables the network to focus on information detail learning of the original image and obtains satisfying results. The RRN with residual learning focuses on learning abundant local features in images and achieves a high PSNR.

Table 3
Quantitative comparison of VID4, SPMC, and UDM10 datasets for 4x VSR and PSNR (dB) values and SSIM

Method Bicubic FRVSR [8] DUF [23] RBPN [22] RLSP [21] Proposed

Param[M] N/A 5.1 5.8 12.8 4.3 3.7

Runtime[ms] N/A 129 1393 3482 50 46

Vid4 21.80/0.5426 26.48/0.8104 27.38/0.8329 27.17/0.8205 27.48/0.8388 27.91/0.8527

Spmc 23.29/0.6385 28.16/0.8421 29.63/0.8719 29.73/0.8663 29.59/0.8762 30.37/0.8911

Udm10 28.47/0.8523 37.09/0.9522 38.48/0.9605 38.66/0.9596 38.48/0.9606 39.32/0.9681

Method	Bicubic	FRVSR [8]	DUF [23]	RBPN [22]	RLSP [21]	Proposed
Param[M]	N/A	5.1	5.8	12.8	4.3	3.7
Runtime[ms]	N/A	129	1393	3482	50	46
Vid4	21.80/0.5426	26.48/0.8104	27.38/0.8329	27.17/0.8205	27.48/0.8388	27.91/0.8527
Spmc	23.29/0.6385	28.16/0.8421	29.63/0.8719	29.73/0.8663	29.59/0.8762	30.37/0.8911
Udm10	28.47/0.8523	37.09/0.9522	38.48/0.9605	38.66/0.9596	38.48/0.9606	39.32/0.9681

2) Qualitative Comparison: Most existing VSR methods, such as FRVSR, DUF, RBPN, and RLSP, focused on PSNR-based results and not perceptual details [8, 21–23]. The proposed temporal method reconstructs consistent frames with fewer flickering artefacts than other VSR methods. Figure 4 shows the superior visual quality of our CDPN in qualitative results.

Fig. 4

Qualitative comparison on the VID4, SPMC and UDM10 datasets for 4× VSR.

All dataset consists of rich information in scenes like sharp edges and fine texture with fewer artefacts. The proposed method produces sharper edges in the calendar scene, the fine texture of the RMVTG_011 scene, and a clear pattern in the auditorium scene. All dataset consists of rich information in scenes like sharp edges and fine texture with fewer artefacts. The proposed method produces sharper edges in the calendar scene, the fine texture of the RMVTG_011 scene, and a clear pattern in the auditorium scene. In contrast, it is difficult for the remaining methods to recover the high-definition outcomes as they suffer from unpleasant blurring artefacts and unclear structure. The proposed CDPN method recovered and obtained sharper results by detailed conformation. The results indicate that the proposed method generates more realistic and natural outputs than other methods.

3) Model Parameter Size Comparison: Table 4 compares the proposed and state-of-the-art method parameters. The proposed method produces better results with fewer parameters, resulting in the highest efficiency in terms of parameters compared to other methods, demonstrating exceptional performance with less computational overload.

Table 4

Model size and performance on VID4 dataset with scaling 4 ×

Model	Parameter	PSNR
FRVSR [8]	5.1M	26.48
DUF [23]	5.8M	27.38
RBPN [22]	12.8M	27.17
RLSP [21]	4.3M	27.48
Proposed	3.7M	27.91

4.4 Ablation study

This ablation study works in two models of 5 L and 10 L residual blocks. Table 5 displays the results of the ablation study performed on the surveillance dataset for a scale factor of 4 to evaluate the impact of conformation details.

Table 5
Analysis of the proposed method on the surveillance dataset with a different model for scale factor ×4 and values in PSNR

Model Index Model 1 Model 2 Model 3 Model 4

Residual blocks-5 √ √ × ×

Residual blocks-10 × × √ √

Kernel_1 √ × √ ×

Kernel_3 × √ × √

PSNR (Without conformation details) 23.39 23.43 24.21 24.32

PSNR (With conformation details) 23.48 23.52 24.34 24.48

Model Index	Model 1	Model 2	Model 3	Model 4
Residual blocks-5	√	√	×	×
Residual blocks-10	×	×	√	√
Kernel_1	√	×	√	×
Kernel_3	×	√	×	√
PSNR (Without conformation details)	23.39	23.43	24.21	24.32
PSNR (With conformation details)	23.48	23.52	24.34	24.48

For investing the impact of the kernel on the given dataset, this work used kernel_1 and kernel_3 on Model1, Model2, Model3, and Model4, respectively. Model 3 and Model 4 constructed a better image and high performance with more residual blocks compared to Model 1 and Model 2. Figure 5 shows the further analysis of the qualitative comparison of surveillance data with different residual blocks as 5 L and 10 L.

Fig. 5

Qualitative comparison of surveillance dataset for 4× VSR.

Moreover, this work plotted the PSNR with time to examine the information flow between various temporal modeling methods. Figure 6 shows the video series of the calendar sequence in VID4, RMVTG_011 sequences in SPMC, and auditorium sequences in UDM10. The method without conformation details falls overdue after a few frames. The proposed method outperforms the other method by incorporating conformation details, allowing for information accumulation over time.

Fig. 6

Information flow for VID4, SPMC, UDM10, and Surveillance data on sequences over time.

More interestingly, the CDPN-based method keeps improving while, without conformation detailing, the method suffers from performance degradation. Information from a previously hidden state is complementary to restoring missing details. In addition, this work also plotted surveillance data information flow with and without conformation details, demonstrating high performance. Through the above experimental results and analyses, the findings of this work are that both the network architecture and the proposed loss contribute to the visual improvements and that the network architecture plays a crucial role in achieving high performance.

5 Conclusion

In this research work, we proposed a new method called SuperVidConform. It is designed to improve the accuracy of VSR by producing results with both high PSNR and visually pleasing quality. The SuperVidConform architecture consists of conformation detail and RNN with residual learning. The conformation detail-learning enables the network to focus on detailed information learning of the original image. It can be supervised by the ground-truth HR image and obtained satisfying results. The RNN with residual learning focus on learning abundant local feature contained in images and achieves high PSNR. Further, the proposed network paid more attention to surveillance data to obtain sufficient sharpness recovery and archives the model optimization. Comparing the recovered results at ×4 scale on the VID4, SPMC, and UDM10 datasets w.r.t. RLSP, the proposed CDPN showed an improvement of 0.43 dB, 0.78 dB, and 0.84 dB in terms of PSNR. The results of the experiments showed that the CDPN outperforms other methods on various benchmark datasets and the surveillance dataset, not only in PSNR but also in visual quality.

References

Farsiu

Robinson

Elad

Milanfar

, Advances and challenges in super-resolution, International Journal of Imaging Systems and Technology 14 (2004), 47–57. https://doi.org/10.1002/ima.20007.

Ren

Peng

Jiang

, Towards efficient video detection object super-resolution with deep fusion network for public safety, Security and Communication Networks 2021 (2021), 1–3. https://doi.org/10.1155/2021/9999398 .

Sreenu

Saleem

M.A.

, Durai, Intelligent video surveillance: a review through deep learning techniques for crowd analysis, Journal of Big Data 6 (2019), 48. https://doi.org/10.1186/s40537-019-0212-5.

Zhou

Yang

Liao

, Interpolation-based image super-resolution using multisurface fitting, IEEE Transactions on Image Processing 21 (2012), 3312–3318. https://doi.org/10.1109/TIP.2012.2189576.

Chen

C.L.P.

Liu

Chen

Tang

Y.Y.

Zhou

, Weighted couple sparse representation with classified regularization for impulse noise removal, IEEE Transactions on Image Processing 24 (2015), 4014–4026. https://doi.org/10.1109/TIP.2015.2456432.

Liu

Chen

C.L.P.

Tang

Y.Y.

pun

C.M.

, Weighted joint sparse representation for removing mixed noise in image, IEEE Transactions on Cybernetics 47 (2017), 600–611. https://doi.org/10.1109/TCYB.2016.2521428.

Liu

Wang

Fan

Liu

Wang

Chang

Huang

, Robust video super-resolution with learned temporal dynamics, In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2526–2534. IEEE, Venice (2017).

Sajjadi

M.S.M.

Vemulapalli

Brown

, Frame-recurrent video super-resolution, (2018). http://arxiv.org/abs/1801.04590

Xue

Chen

Wei

Freeman

W.T.

, Video enhancement with task-oriented flow, Int J Comput Vis 127 (2019), 1106–1125. https://doi.org/10.1007/s11263-018-01144-2

10.

Liu

Ruan

Zhao

Dong

Shang

Liu

Yang

Timofte

, Video super resolution based on deep learning: A comprehensive survey, (2022). http://arxiv.org/abs/2007.12928

11.

Chadha

Britto

Roja

M.M.

, iSeeBetter: Spatio-temporal video super-resolution using recurrent generative back-projection networks, Comp Visual Media 6 (2020), 307–317. https://doi.org/10.1007/s41095-020-0175-7.

12.

Takeda

Milanfar

Protter

Elad

, Super-resolution without explicit subpixel motion estimation, IEEE Transactions on Image Processing 18 (2009), 1958–1975. https://doi.org/10.1109/TIP.2009.2023703.

13.

Protter

Elad

, Super resolution with probabilistic motion estimation, IEEE Trans. on Image Process 18 (2009), 1899–1904. https://doi.org/10.1109/TIP.2009.2022440.

14.

Liu

Sun

, On Bayesian adaptive video super resolution, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 346–360. https://doi.org/10.1109/TPAMI.2013.127.

15.

Liao

Tao

Jia

, Handling motion Blur in multi-frame super-resolution, Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015).

16.

Kim

Lee

J.K.

Lee

K.M.

, Accurate image super-resolution using very deep convolutional networks, Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016).

17.

Caballero

Ledig

Aitken

Acosta

Totz

Wang

Shi

, Real-time video super-resolution with spatio-temporal networks and motion compensation, (2017). http://arxiv.org/abs/1611.05250

18.

Wang

Chen

Hoi

S.C.H.

, Deep learning for image super-resolution: A survey, IEEE Trans Pattern Anal Mach Intell 43 (2021), 3365–3387. https://doi.org/10.1109/TPAMI.2020.2982166.

19.

Kappeler

Yoo

Dai

Katsaggelos

A.K.

, Video super-resolution with convolutional neural networks, IEEE Transactions on Computational Imaging 2 (2016), 109–122. https://doi.org/10.1109/TCI.2016.2532323.

20.

Tao

Gao

Liao

Wang

Jia

, Detail-revealing Deep Video Super-resolution, (2017). http://arxiv.org/abs/1704.02738

21.

Fuoli

Timofte

, Efficient video superresolution through recurrent latent space propagation, (2019). http://arxiv.org/abs/1909.08080

22.

Haris

Shakhnarovich

Ukita

, Recurrent back-projection network for video super-resolution, (2019). http://arxiv.org/abs/1903.10128

23.

S.W.

Kang

Kim

S.J.

, Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation, In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. IEEE, Salt Lake City, UT (2018).

24.

Wang

Chan

K.C.K.

Dong

Loy

C.C.

, EDVR: Video restoration with enhanced deformable convolutional networks, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 1954–1963. IEEE, Long Beach, CA, USA (2019).

25.

Wang

Jiang

, Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.3106–3115. IEEE, Seoul, Korea (South) (2019).

26.

Shi

Caballero

Huszár

Totz

Aitken

A.P.

Bishop

Rueckert

Wang

, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, (2016). http://arxiv.org/abs/1609.05158

27.

Cai

, Yang, Y.-H., F. Wu and D. Zhang, TDPN: Texture and detail-preserving network for single image super-resolution, IEEE Transactions on Image Processing 31 (2022), 2375–2389. https://doi.org/10.1109/TIP.2022.3154614.

28.

Isobe

Jia

Wang

Tian

, Video super-resolution with recurrent structure-detail network, (2020). http://arxiv.org/abs/2008.00455

29.

Baker

Kanade

, Super-resolution optical flow, (1999).

30.

Huang

J.-B.

Singh

Ahuja

, Single image superresolution from transformed self-exemplars, In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5197–5206. IEEE, Boston, MA, USA (2015).

31.

Zhao

Sawhney

H.S.

, Is super-resolution with optical flow feasible? In: A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, (eds.) Computer Vision –ECCV 2002. pp. 599–613. Springer, Berlin, Heidelberg (2002).

32.

Xie

Liu

Zhang

Yuan

, Optical flow for video super-resolution: A survey, (2022). http://arxiv.org/abs/2203.10462

33.

Xiao

Yao

Zhang

Xiong

, Stereo video super-resolution via exploiting view-temporal correlations, In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 460–468. Association for Computing Machinery, New York, NY, USA (2021).

34.

Wang

Jiang

Han

, Multi-memory convolutional neural network for video super-resolution, IEEE Transactions on Image Processing 28 (2019), 2530–2544. https://doi.org/10.1109/TIP.2018.2887017.

35.

Wang

Sun

Cheng

M.-M.

, Temporal modulation network for controllable space-time video super-resolution, Presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021).

36.

Wang

Guo

Liu

Lin

Deng

, Deep video super-resolution using hr optical flow estimation, IEEE Trans on Image Process 29 (2020), 4323–4336. https://doi.org/10.1109/TIP.2020.2967596.

37.

Liao

Tao

Jia

, Video superresolution via deep draft-ensemble learning, In: 2015 IEEE International Conference on Computer Vision (ICCV), pp.531–539 (2015).

38.

Huang

Wang

, Bidirectional recurrent convolutional networks for multi-frame super-resolution.

39.

Kim

Lee

J.K.

Lee

K.M.

, Deeply-recursive convolutional network for image super-resolution, In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1637–1645. IEEE, LasVegas,NV,USA(2016).

40.

Isobe

Zhu

Jia

Wang

, Revisiting temporal modeling for video super-resolution, (2020). http://arxiv.org/abs/2008.05765

41.

Gordon

Farhadi

Fox

, Real-time recurrent regression networks for visual tracking of generic objects, (2018). http://arxiv.org/abs/1705.06368

42.

Chen

Pan

Yao

Chao

Mei

, Temporal deformable convolutional encoder-decoder networks for video captioning, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 8174–8174. https://doi.org/10.1609/aaai.v33i01.33018167.

43.

Zhang

Xue

Lan

Zeng

Gao

Zheng

, EleAtt-RNN: Adding attentiveness to neurons in recurrent neural networks, IEEE Transactions on Image Processing 29 (2020), 1061–1073. https://doi.org/10.1109/TIP.2019.2937724.

44.

Denton

Chintala

Szlam

Fergus

, Deep generative image models using a Laplacian Pyramid of adversarial networks, (2015). http://arxiv.org/abs/1506.05751

45.

Tang

Azizi

Jang

, Contextual residual aggregation for ultra high-resolution image inpainting, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7505–7514. IEEE, Seattle, WA, USA (2020).

46.

Lim

Son

Kim

Nah

Lee

K.M.

, Enhanced deep residual networks for single image super-resolution, (2017). http://arxiv.org/abs/1707.02921

47.

Niu

Wen

Ren

Zhang

Yang

Wang

Zhang

Cao

Shen

, Single image superresolution via a holistic attention network, In: A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, (eds.) Computer Vision –ECCV 2020, pp. 191–207. Springer International Publishing, Cham (2020).

48.

Fang

Mei

Zhang

, Multi-scale residual network for image super-resolution, In: V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, (eds.) Computer Vision –ECCV 2018. pp. 527–542. Springer International Publishing, Cham (2018).

49.

Zhang

Wang

Zhong

, Image super-resolution using very deep residual channel attention networks, Presented at the Proceedings of the European Conference on Computer Vision (ECCV), (2018).

50.

Chiche

B.N.

Woiselle

Frontera-Pons

Starck

J.-L.

, Stable long-term recurrent video super-resolution, In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 827–836. IEEE, NewOrleans, LA, USA (2022).

SuperVidConform: Conformation detail-preserving network (CDPN) for video super-resolution

Abstract

Keywords

1 Introduction

2 Related work

2.1 Temporal correlation

2.2 Exploitation of motion information

2.3 Recurrent neural network

3 Conformation detail preserving network

3.1 Network framework

4 Experimental analysis

4.1 Datasets

4.2 Implementation details

Table 1 Specification comparison of temporal modeling methods Method 2D CNN Proposed 2D CNN Proposed S S L L Blocks 5 5 10 10 Input Frames 7 Recurrent 7 Recurrent # Param. [M] 2.8 2.1 4.3 3.7 RuntimE [ms] 97 30 116 46

References

Table 1
Specification comparison of temporal modeling methods

Method 2D CNN Proposed 2D CNN Proposed

S S L L

Blocks 5 5 10 10

Input Frames 7 Recurrent 7 Recurrent

# Param. [M] 2.8 2.1 4.3 3.7

RuntimE [ms] 97 30 116 46