Progressive Dunhuang murals inpainting based on recurrent feature reasoning network

Abstract

The Dunhuang murals, notably the paintings on the interior walls of China’s Dunhuang Grottoes, are considered international cultural treasure. The Dunhuang murals were ruined to varied degrees after a lengthy period of erosion. Deep learning networks were utilized to reconstruct broken parts of murals in order to better preserve their important historical and cultural values. Due to the presence of various damages, such as large peeling, mold and scratches, and multi-scale objects in the mural, a simple porting of existing working methods is suboptimal. In this paper, we propose a progressive Dunhuang murals inpainting (PDMI) based on recurrent feature reasoning network to progressively infer the pixel values of hole centers by a progressive approach, aiming to obtain visually reasonable and semantically consistent inpainted results. PDMI consists mainly of the FFC-based recurrent feature reasoning (RFR) module and Multi-scale Knowledge Consistent Attention (MKCA) module. The RFR module first fills in the feature value at the feature map’s hole border, then utilizes the obtained feature value as a clue for further inference. The module steadily improved the limitation of hole centers, making the inpainted results more explicit; MKCA enables feature maps in RFR to handle richer background information from distant location information in a flexible manner while preventing misuse. After several round-robin inferences provide multiple feature maps, these feature maps are fused using an adaptive feature weighted fusion mechanism, then the fused feature maps decode back to RGB image. Experiments on a publicly available dataset and a self-made Dunhuang mural dataset reveal that the proposed method outperforms the comparison algorithm in both qualitative and quantitative aspects.

Keywords

Image inpainting Dunhuang murals progressive inpainting feature reasoning

1 Introduction

The Mogao Grottoes in Dunhuang are the largest and most abundant treasure house of Buddhist grotto murals in the world. The murals and scriptures stored in them have precious research value. However, due to the damage of natural weathering and the influence of human factors, the murals in the caves have suffered serious disasters such as falling off, scratches, fading, cracks [1], which need to be protected urgently. Therefore, it is extremely important to study the restoration of the damaged Dunhuang murals. Traditional mural restoration methods rely on manual restoration, which has the risk of irreversibility. Therefore, there is a growing interest in using digital restoration techniques for the protection and preservation of ancient murals, which has become a hot research topic in recent years [2 –8].

Image inpainting as a way of mural conservation can be divided into traditional methods and learning-based methods. Traditional exemplar-based image inpainting methods [9 –11] usually through copy-and-paste or information propagation for inpainting. These methods can complete images with small broken areas or repetitive patterns to some extent, but they often do not perform satisfactorily enough for images with large broken areas or complex structures. This is mainly because these methods focus only on the low-level features of the image and are not able to understand the semantic information of the image, which makes it difficult to produce reasonable inpainted results.

In recent years, deep learning-based methods for image inpainting have made significant progress. These methods can be broadly classified into two categories: one-stage methods and two-stage methods. One-stage methods, such as those described in [12 –15], aim to synthesize both reasonable semantic contents and fine details in a single network. On the other hand, two-stage methods use two serial networks, a coarse network and a refinement network, to deal with coarse and fine contents separately. Two-stage methods have explored the use of coarse-to-fine inpainting models in various ways. For example, they may use a coarse network to generate intermediate outputs such as edges [16], structural information [17], segmentation maps [18], which are then used as guidance for filling in the missing details in the refinement network. Although these methods have achieved better results than before, they apply an encoder-decoder architecture, assuming that once a damaged image is encoded, it should contain enough information for reconstruction and can be inpainted in a one-shot. For minor damage, this assumption is reasonable since pixels in a local area may have significant association, and hence pixels can be inferred from the surrounding environment. However, as the damaged region of the image grows larger, the distance between the known and unknown pixels increases, the correlation between the hole center and the hole boundary decreases, and the constraint of the hole center lessens. In this situation, it is impossible to efficiently recover the pixel at the center of the hole from the information in the known area, resulting in semantic ambiguity in the network.

To tackle the aforementioned limitation, Li et al. [19] proposed the Recurrent Feature Reasoning network (RFR-Net), which employs a progressive inpainting strategy to gradually reinforce the constraints on the hole centers. The RFR-Net network is composed of two key components: the Recurrent Feature Reasoning (RFR) Module and the Knowledge Consistent Attention (KCA). The RFR Module recurrently predicts the hole boundaries of the feature maps and utilizes them as guidelines for further inference. As the RFR Module advances, it progressively intensifies the constraints on the hole center, leading to more explicit and accurate inpainted results. To capture information from remote areas in the feature map for RFR, the KCA is integrated into the RFR network, which enhances the network’s ability to reason over the entire feature space and to maintain coherence in the inpainting process.

Inspired by the RFR-Net, we designed a progressive Dunhuang murals inpainting (PDMI) based on recurrent feature reasoning network to address various possible damages in mural images (such as large peeling, scratches, and mold), as well as multi-scale objects that may have different scales of the same or similar objects in the mural. Compared to RFR-Net, our main differences are as follows. First, the RFR Module we designed is based on Fast Fourier Convolution (FFC) [20]. FFC has a global receptive field, which can better obtain global information. Secondly, we designed a multi-scale knowledge consistency attention module, and compared with a single scale KCA, our MKCA can better capture multi-scale information. Finally, we designed a weighted adaptive fusion module for fusing multiple feature maps, while using an average approach in RFR ignores the importance of feature maps generated at different stages. In summary, the following are the main contributions of our work:

Multi-scale Knowledge Consistency Attention (MKCA) module is proposed to flexibly borrow background information, avoid misuse of background information, and ensure the consistency of patch exchange process between recursions.

Based on FFC framework, the RFR module is designed to enhance reasoning efficiency. It enables integration of multi-scale mural information from preliminary layers, thereby improving the efficiency of detailed complex representations.

An adaptive weighted fusion mechanism is designed to fuse feature maps generated at different inference stages to obtain more refined results.

2 Related work

There are two types of traditional inpainting methods: geometry-based approaches and patch-based methods. The geometry-based approaches work by using partial differential equations to propagate the local structure from the outside to the interior of the hole. Patch-based approaches pick comparable patches from a database or an undamaged environment and paste them into damaged parts based on patch distance measurements (e.g. euclidean distance, screening distance [21], etc.). PatchMatch stands up among other previous traditional approaches [9]. PatchMatch fills holes faster than other traditional approaches, however the inpainting will be poor if no matching texture is identified in the image. Patch-based image inpainting approaches can yield precise results that resemble the backdrop. Patch-based approaches, on the other hand, struggle to yield semantically appropriate results due to their incapacity to capture the global semantics of images. There are three main problems with all traditional methods. First, they fail to capture global structure and image semantics. Second, they cannot fill complex holes [9, 13]. Finally, the time cost of completing a large hole is high.

To produce semantically consistent and visually reasonable results, most learning-based methods rely on a CNN or a GAN [22, 23]. Pathak et al. [12] proposed Context Encoders, the first attempt to inpaint images with large damaged areas effectively. To ensure global and local consistency of foreground and background, Iizuka et al. [13] introduced a double discriminator, but post-processing need to use Poisson blending to reduce artifacts [24]. Yu et al. [25] proposed a framework for coarse to fine image inpainting that includes context attention modules. The convolution layer was modified to handle irregular masks by Liu et al. [14] and Yu et al. [26]. However, in the case of a large broken area, these methods cannot restore semantically reasonable results.

Recent research has focused on progressive image inpainting. To ensure structural consistency, Xiong et al. [27] and Nazeri et al. [16] driven image completion by contour/edge constraints. Xu et al. [28] presented a three-stage progressive inpainting network. To complete end-to-end progressive inpainting, Li et al. [29] incorporated a gradually reconstructed structural map as an extra training target. The approaches presented above attempt to tackle the inpainting challenge by applying structural prior restrictions, however they lack the information in the backbone to recover the deeper pixels in the hole. Zhang et al. [30] filled the image step by step using cascaded generators. Guo et al. [31 directly inpainted images of original size using a single-stage feed-forward network. 32] employed the Onion-peel approach to progressively restore the video data with the aid of the reference frame content, achieving accurate content borrowing. However, these approaches are typically constrained by traditional progressive inpainting or multi-stage procedures.

3 Proposed method

In this section, we will introduce the PDMI network in detail. The PDMI network takes the masked image as input and first maps it through the Mapping module to obtain the feature map F_in. The RFR module then identifies the area to be inferred in F_in, progressively fills in the missing regions, and generates a feature map at each stage. The feature merge operator is used to fuse the feature maps generated by each stage, and the final feature map is fed into the decoder to obtain the inapinted image. To establish long-term dependencies on the pixels in the feature map, the MKCA module is utilized to assist the RFR module. The complete network architecture can be observed in Fig. 1, which provides a better understanding of the entire image inpainting process.

Fig. 1

Network architecture. The network takes the masked image as input and maps it through the mapping module to obtain the feature map F_in. The RFR module progressively infers the missing area in F_in and generates a feature map at each stage. The feature maps are fused using the feature merge operator, and the final feature map is decoded to obtain the inpainted image.

3.1 Mapping module

The mapping module, which consists of a downsampling convolution and three vanilla convolutions, transforms the masked image X into the convolutional feature space, generating the feature input F_in for the feature inference module. This can be mathematically expressed as:

$F_{in} = mapping (X)$ (1) The resulting feature input F_in contains high-level semantic information from the input image and is used for subsequent feature reasoning.

3.2 Recurrent feature reasoning module

The recurrent feature reasoning module contains three modules: 1) an area identification module to identify the area to be filled in the current recursion; 2) a feature reasoning module to infer the content in the identified region; and 3) the feature merge module to fuse intermediate feature maps. The module alternates between area identification and feature reasoning. After filling the holes, the feature maps generated during inference are combined to form a fixed-channel feature map.

3.2.1 Area identification

Partial convolution [14] can be used to identify the area to be updated in each recursion. However, the mask obtained through partial convolution layer is not learnable, making the network difficult to train. Therefore, LAM in LBAM is used to replace partial convolution, and the difference between the updated mask and the input mask is defined as the area to be inferred in current recursion [33]. After convolution computation, LAM layers update masks and renormalize feature maps. A LAM layer is described formally as follows. Let F^* denote the feature maps produced by LAM layers. $f_{x, y, z}^{*}$ represent the feature value at positions x, y in the z^th channel. In this layer, W_z is the z^th convolution kernel. f_x,y and $m_{x, y}^{c}$ are the input feature patch and input mask M^c patch located at x, y position respectively. The feature map computed by the LAM layer can then be expressed as:

$f_{x, y, z}^{*} = W_{z}^{T} (f_{x, y} ⊙ m_{x, y}^{c})$ (2) where $m_{x, y}^{c} \in g_{A} (M^{c})$ is used for feature re-normalization.

$\begin{matrix} g_{A} (M^{c}) \\ = {\begin{matrix} a exp (- γ_{l} {(M^{c} - μ)}^{2}), & M^{c} < μ \\ 1 + (a - 1) exp (- γ_{r} {(M^{c} - μ)}^{2}), & Other \end{matrix} \end{matrix}$ (3) where a, μ, γ_l, and γ_r are the learnable parameters, we initialize them like in [33]. The update mask generated by this layer is expressed as:

$M^{'} = g_{M} (M^{c}) = {(ReLU (M^{c}))}^{α}$ (4) According to the above equation, we can get new masks with smaller holes after each LAM layer.

We cascade two LAM layers as the area identification module. After passing through the area identification module, the feature maps are fed to the feature reasoning module after being processed by the normalization layer and activation function. Note that the holes in the mask remain unchanged during the current recursion.

3.2.2 Feature reasoning module

After obtaining the region to be inferred, the feature reasoning module estimates the feature value of the region. Continuously estimating the feature value of the region to be inferred at the feature map level helps to obtain high-quality feature value, which high-quality features can make the restoration effect more in line with human visual perception and facilitate subsequent reasoning. To obtain high-quality feature value, we design an FFC-based network architecture. FFC has a receptive field that covers the entire image, allowing the use of global context at early layers. High receptive field is essential in image inpainting, allowing the network to utilize more valid information. FFC are split into local ones encoded by vanilla convolutions and global ones encoded by the spectral transform. The spectral transform is consisted of Fast Fourier Transform (FFT), Conv → BatchNorm → ReLu, and the inverse FFT. And both the real and imaginary parts are confirmed in the Conv2D after FFT. After the inverse FFT, local and global features are combined as the final output. The illustration of FFC is available in Fig. 2.

Fig. 2

The schematic illustration of the Fast Fourier convolution (FFC).

Multi-scale Knowledge Consistent Attention The attention module in image inpainting aims to infer the corrupted area by borrowing features from faraway spatial positions to produce better quality features [25]. In MKCA, the scores of different scales are composed of the scores accumulated recursively and proportionally previously, which maintains consistency between the attention feature maps and allows it to process more abundant background information flexibly while avoiding misuse [34].

Let Fⁱ denote the i^th recursive input feature map. As shown in Fig. 3, we initiatively divide Fⁱ into multi-scale patches of size 1 × 1 and 3 × 3 and compute the inter-patch normalized inner product:

Fig. 3

The schematic illustration of the multi-scale knowledge consistent attention (MKCA) module.

${\hat{sim}}_{p, q, p^{'}, q^{'}}^{i} = < \frac{f_{p, q}}{| | f_{p, q} | |}, \frac{f_{p^{'}, q^{'}}}{| | f_{p^{'}, q^{'}} | |} >$ (5) where ${\hat{sim}}_{x, y, x^{'}, y^{'}}^{i}$ represents the similarity between the patches of Fⁱ centered at (p, q) in the hole region and that centered at (p′, q′). The attention scores are then smoothed by averaging the similarity of target pixels in neighboring regions.

$\begin{matrix} sim'_{p, q, p^{'}, q^{'}}^{i} \\ = \frac{\sum_{m, n \in {- k, \dots, k}} {\hat{sim}}_{p + m, q + n, p^{'}, q^{'}}}{k \times k} \end{matrix}$ (6)

Then, apply softmax function along p′ - q′ dimension to normalize inter-patch similarity. The produced score map is marked score′. We follow the operation in [19] to calculate the attention score in scoreⁱ for a pixel: if the pixel is valid, its attention score in the current recursion is the weighted sum of the attention scores of the current recursion and the previous recursion. Specifically, if the pixel at position (p, q) is a valid pixel in the previous recursion (i.e., the mask value $m_{p, q}^{i - 1}$ is 1), we use a learnable parameter λ to adaptively combine the final score of pixels in the last recursion with the score calculated in this recursion, as shown below:

$\begin{matrix} {score}_{p, q, p^{'}, q^{'}}^{i} \\ = λ score'_{p, q, p^{'}, q^{'}}^{i} + (1 - λ) {score}_{p, q, p^{'}, q^{'}}^{i - 1} \end{matrix}$ (7)

If the pixel was invalid in the previous recursion, no additional operations are performed, and the final attention score for the pixel in the current recursion is calculated as follows:

${score}_{p, q, p^{'}, q^{'}}^{i} = score'_{p, q, p^{'}, q^{'}}^{i}$ (8)

Then, the attention score is used to reconstruct the feature map. Formally, the reconstruction feature map at position (p, q) is calculated as follows:

${\hat{f^{i}}}_{p, q} = \sum_{p^{'} \in 1, \dots, W q^{'} \in 1, \dots, H} {score}_{p, q, p^{'}, q^{'}}^{i} f_{p^{'}, q^{'}}^{i}$ (9)

Finally, concatenate the generated feature maps, denoted by <φ_att11, φ_att33>. then feature maps fed into a squeeze-and-excitation (SE) to decide which level of detail is the most important one on current image [35]. The final output of the module could be denoted as:

$φ_{out} = f_{Conv} (f_{SE} ((< φ_{att 11}, φ_{att 33} >))$ (10)

3.2.3 Feature merging module

When the feature map passes through the feature reasoning module several times until it is completely filled, directly using the last feature map to generate the output, gradient vanishing may occur and it will corrupt the signal generated in the previous recursion. RFR-Net uses an adaptive merging scheme to solve this problem. The difference is that we use the adaptive weighting scheme to fuse the feature maps generated in different recursion, and the adaptive learning of the respective weights of the feature value fusion, rather than treating all the feature values equally [36]. Similarly, the values in the output feature map are only based on the features that have filled the corresponding positions.

Formally, we define Fⁱ as the i^th feature map generated by the feature reasoning module, and f_x,y,z as the values at positions x, y, z in the feature map F. Mⁱ is the binary mask of the feature map Fⁱ. The value at the output feature map $\hat{F}$ can be defined as: ${\hat{f}}_{x, y, z} = \frac{\sum_{i = 1}^{N} α_{x, y, z}^{i} m_{x, y, z}^{i} f_{x, y, z}^{i}}{\sum_{1}^{N} m_{x, y, z}^{i}}, \sum_{i = 1}^{N} α_{x, y, z}^{i} m_{x, y, z}^{i} = 1$ (11) where N is the number of feature map.

3.3 Decode module

The decode module, consisting of one upsampling operation and three vanilla convolutions, generates the inpainted image in the RGB image space from the fused feature map. This can be expressed as:

$F_{out} = decode (\hat{f})$ (12)

The decode module is responsible for generating the final inpainted image from the fused feature map.

3.4 Loss function

The loss functions used in this paper mainly include perceptual loss, style loss, multi-scale edge loss, hole loss, and valid loss.

The perceptual loss and style loss compare the semantic difference between the generated image and the ground truth [37]. To extract semantic features, a VGG19 network pre-trained on the ImageNet database is used [38, 39]. A loss function of this type can effectively teach the model the structural and textural information of an image. The following is a representation of these loss functions. The feature map of the i^th pooling layer in fixed VGG-16 is denoted by φ_{pool
_i}. The size of φ_{pool
_i} is represented by H_i × W_i × C_i. The perceptual loss can then be written as:

$L_{pcp} = \sum_{i = 1}^{N} \frac{1}{H_{i} W_{i} C_{i}} | φ_{{pool}_{i}}^{gt} - φ_{{pool}_{i}}^{pred} |_{1}$ (13)

Again, the style loss is calculated as follows:

$\begin{matrix} φ_{{pool}_{i}}^{style} = φ_{{pool}_{i}} φ_{{pool}_{i}}^{T}, \\ L_{st} = \sum_{i = 1}^{N} & \frac{1}{C_{i} \times C_{i}} | \frac{1}{H_{i} W_{i} C_{i}} (φ_{{pool}_{i}}^{{style}_{gt}} - φ_{{pool}_{i}}^{{style}_{pred}}) |_{1} \end{matrix}$ (14) where gram matrix $φ_{{pool}_{i}}^{style}$ can be regarded as the style indicator of the image, which is realized by establishing the correlation between channels.

We use a multi-scale edge loss to guide structure generation, forcing the model to pay more attention to the structure of the generated image, thereby implicitly encouraging the generator to leverage relevant structural knowledge when inpainting.

$L_{ms} = \sum_{s}^{n_{s}} | | C_{pred}^{(s)} - C_{gt}^{(s)} | |_{1}$ (15) where n_s is the number of total scales.

The adversarial loss (L_adv) quantifies the inpainting verisimilitude. In addition, L_valid and L_hole are also used in the model, which calculate the L₁ difference in unmasked and masked regions, respectively. So the total loss function is:

$\begin{matrix} L_{total} & = λ_{hole} L_{hole} + λ_{valid} L_{valid} + λ_{st} L_{st} \\ + λ_{pcp} L_{pcp} + λ_{adv} L_{adv} + λ_{ms} L_{ms} \end{matrix}$ (16) where the weight coefficients λ_hole, λ_valid, λ_st, λ_pcp, λ_adv, λ_ms are hyperparameters of the proposed inpaint- ing model. They are set to 1, 6, 0.05, 120, 0.5 and 0.5 in our experiments, respectively.

4 Experiments

In this section, experiments on the Dunhuang mural dataset were conducted out, and the visualization results and index comparison were displayed. In addition, to further validate the algorithm’s effectiveness, we ran experiments on the public dataset Paris StreetView Dataset. Our computing device is equipped with a 2.1GHz Intel(R) Xeon(R) CPU and a 12GB NVIDIA Titan Xp GPU. Our programming environment is PyTorch v1.1, which is running on the Ubuntu 18.04 operating system.

4.1 Datasets

Dunhuang Mural Dataset: As the dataset source, choose 3000 different Dunhuang mural images, primarily Tang Dynasty murals, and expand the dataset to obtain 12,000 mural datasets, 10,000 mural images in the training dataset and 2,000 mural images in the test dataset. The mask dataset is from [14], and it contains 12000 sheets.

Paris StreetView Dataset [40]: The dataset contains 14,900 training images and 100 test images, primarily of real-world street scenes, such as buildings, trees, streets, and sky.

4.2 Inpainting of Dunhuang mural dataset

We conducted extensive experiments on the Dunhuang mural dataset to further validate the effectiveness of the proposed method. The experiment is evaluated in comparison to other advanced image inpainting methods. So that the benefits of this method can be demonstrated intuitively. We demonstrated the visualization and numerical comparison of partial inpainting results of Dunhuang murals in Fig. 4 and Table 1. The results of Pconv restore are prone to color artifacts, as shown in Fig. 4, whereas Rethink, RFR, and CTSDG all show varying degrees of structural distortion in the first image, and we successfully obtain more reasonable visual results on all images.

Fig. 4

Comparison of mural inpainting results of random masked by different algorithms. From left to right are: (a) GT, (b) Input, (c) PConv [14], (d) Rethink [41], (e) RFR [19], (f) CTSDG [42], (g) Ours.

Table 1

Numerical comparison on Dunhuang Mural Dataset

images	PConv		Rethink		RFR		CTSDG		Ours
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
1	22.90	0.786	24.81	0.824	26.26	0.866	27.45	0.881	27.46	0.886
2	22.90	0.786	24.81	0.824	26.16	0.850	27.02	0.861	27.24	0.863
3	23.09	0.793	25.09	0.827	21.94	0.808	22.46	0.817	23.25	0.826
4	26.08	0.866	26.42	0.876	28.21	0.885	29.11	0.890	29.38	0.891
5	20.34	0.802	22.17	0.827	22.71	0.843	22.81	0.846	22.93	0.848
6	19.55	0.817	21.64	0.844	23.19	0.857	23.45	0.869	24.56	0.870

We further evaluate the inpainting results in Fig. 4 using PSNR and SSIM index, as shown in Table 1. PSNR and SSIM are important quantitative evaluation index to measure image quality. The larger the PSNR value, the less distortion, that is, the better the restoration effect; SSIM is an index to measure the similarity of two images, the higher the value, the better the recovery effect, and the smaller the image distortion. From Table 1, it can be found that the proposed algorithm is superior to other comparative algorithms in PSNR and SSIM, indicating that the proposed method has better repair quality, which verifies the effectiveness of the proposed method for the repair of artificially damaged murals.

4.3 Inpainting of paris StreetView dataset

The inpainting results in Fig. 4 are further evaluated using the PSNR and SSIM indexes, as shown in Table 1. PSNR and SSIM are important quantitative image quality evaluation index. The greater the PSNR value, the less distortion, and thus the better the restoration effect; SSIM is an index that measures the similarity of two images; the greater the value, the better the recovery effect and the less image distortion. According to Table 1, the proposed algorithm outperforms other comparative algorithms in PSNR and SSIM, indicating that the proposed method has better inpainting quality, confirming the proposed method’s effectiveness for the inpainting of artificially damaged murals.

Fig. 5

Visualization of inpainting results of different inpainting algorithms on Paris StreetView Dataset. From left to right are: (a) GT, (b) Input, (c) PConv [14], (d) Rethink [41], (e) RFR [19], (f) CTSDG [42], (g) Ours.

Table 2

Numerical comparison on Paris StreetView Dataset

methods	0% -10%		10% -20%		20% -30%		30% -40%		40% -50%
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Pconv	30.44	0.918	28.21	0.875	26.27	0.822	24.71	0.768	23.06	0.702
Rethink	31.36	0.936	29.40	0.906	27.37	0.866	25.46	0.814	23.48	0.751
RFR	32.45	0.950	29.80	0.916	27.50	0.873	25.82	0.826	24.21	0.767
CTSDG	31.16	0.933	29.20	0.901	27.28	0.861	25.56	0.810	24.02	0.751
Ours	31.40	0.936	29.85	0.916	27.55	0.880	25.86	0.831	24.23	0.767

4.4 Ablation studies

In this part, we validate the effectiveness of the multi-scale knowledge-consistent attention module on the Paris Street View dataset, and as a control group, we kept the same settings and use single-scale knowledge-consistent attention instead of multi-scale knowledge-consistent attention. The experimental results are shown in Fig. 6. As can be seen from Fig. 6, multi-scale attention can make better use of contextual information, and our multi-scale attention can mine more information than single-scale attention to guide the inpainting process in the face of larger breakage.

Fig. 6

Qualitative comparisons of MKCA module. From left to right are: (a) GT, (b) Input, (c) the control group: keep the same settings, use single-scale KCA to replace MKCA, (d) Ours.

5 Conclusion

In this paper, a progressive mural inpainting method based on recurrent feature reasoning is proposed. To begin, it infers the area to be inferred on the feature level repeatedly until the hole is completely filled or the specified number of inferences are reached. After obtaining several inferred feature maps, the adaptive weighted fusion mechanism is used to fuse them, and the fusion map is decoded to obtain the final prediction map. This recursive design can gradually enrich the information of the mask area and produce an inpainting result with clear semantics. Compared with comparison algorithms, our method has achieved competitive results both qualitatively and quantitatively. This shows that the algorithm is effective for inpainting complex Dunhuang murals with large damage. Although our method achieves certain results, there are still some limitations, such as our model does not take into account the color consistency of the mural. In the future, we will further design the loss function and model architecture for the characteristics of the mural.

Conflicts of interest The authors declare that they have no conflict of interest.

Data availability statement Data openly available in a public repository.

Footnotes

Acknowledgment

This work was supported in part by the Gansu Provincial Department of Education University Teachers Innovation Fund Project (No.2023B-056), the Fundamental Research Funds for the Central Universities (No. 31920230137, 31920230030, 31920220130), the Gansu Provincial First-class Discipline Program of Northwest Minzu University (No.11080305), the Leading Talent of National Ethnic Affairs Commission (NEAC), and the Young Talent of NEAC, and the Innovative Research Team of NEAC (2018) 98.

References

Pan

and Lu Dm , Digital protection and restoration of dunhuang mural, Acta Simulata Systematica Sinica (2003), 3.

Zhou

, Liu

, Shang

, Huang

, Li

and Jia

, Inpainting digital dunhuang murals with structure-guided deep network, ACM Journal on Computing and Cultural Heritage 15(4) (2022), 1–25.

Liu

, Du

, Li

, Wang

and Liu

, Dunhuang mural line drawing based on bi-dexined network and adaptive weight learning, In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Springer, (2022), pp. 279–292.

Liu

, He

, Du

, Zhang

and Wang

, Dunhuang murals contour generation network based on convolution and self-attention fusion, Applied Intelligence (2023), pp. 1–13.

Bhele

, Shriramwar

and Agarkar

, An efficient texture-structure conserving patch matching algorithm for inpainting mural images, Multimedia Tools and Applications (2023), pp. 1–22.

Singh

, Maiti

, Saini

, et al., Ancient indian murals digital restoration through image inpainting, In: 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, (2023), pp. 635–640.

, Chang

and Wang

, Fragments inpainting for tomb murals using a dual-attention mechanism gan with improved generators, Applied Sciences 13(6) (2023), 3972.

Wang

, Li

, Liu

, Du

and Gao

, Dunhuang mural line drawing based on multi-scale feature fusion and sharp edge learning, Neural Processing Letters (2023), pp. 1–14.

Barnes

, Shechtman

, Finkelstein

and Goldman

D.B.

, Patchmatch: A randomized correspondence algorithm for structural image editing, ACM Trans Graph 28(3) (2009), 24.

10.

Ballester

, Bertalmio

, Caselles

, Sapiro

and Verdera

, Filling-in by joint interpolation of vector fields and gray levels, IEEE Transactions on Image Processing 10(8) (2001), 1200–1211.

11.

Komodakis

and Tziritas

, Image completion using efficient belief propagation via priority scheduling and dynamic pruning, IEEE Transactions on Image Processing 16(11) (2007), 2649–2661.

12.

Pathak

, Krahenbuhl

, Donahue

, Darrell

and Efros

A.A.

, Context encoders: Feature learning by inpainting, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 2536–2544.

13.

Iizuka

, Simo-Serra

and Ishikawa

, Globally and locally consistent image completion, ACM Transactions on Graphics (ToG) 36(4) (2017), 1–14.

14.

Patel

, Kulkarni

, Sahni

and Vyas

, Image inpainting using partial convolution, arXiv preprint arXiv:210808791, (2021).

15.

Zeng

, Fu

, Chao

and Guo

, Learning pyramid-context encoder network for high-quality image inpainting, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 1486–1494.

16.

Nazeri

, Ng

, Joseph

, Qureshi

F.Z.

and Ebrahimi

, Edgeconnect: Generative image inpainting with adversarial edge learning, arXiv preprint arXiv:190100212, (2019).

17.

Ren

, Yu

, Zhang

, Li

T.H.

, Liu

and Li

, Structureflow: Image inpainting via structure-aware appearance flow, In: Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 181–190.

18.

Song

, Yang

, Shen

, Wang

, Huang

and Kuo

C.C.J.

, Spg-net: Segmentation prediction and guidance network for image inpainting, arXiv preprint arXiv:180503356, (2018).

19.

, Wang

, Zhang

, Du

and Tao

, Recurrent feature reasoning for image inpainting, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 7760–7768.

20.

Chi

, Jiang

and Mu

, Fast fourier convolution, Advances in Neural Information Processing Systems 33 (2020), 4479–4488.

21.

Lowe

D.G.

, Object recognition from local scale-invariant features, In: Proceedings of the seventh IEEE international conference on computer vision, Ieee, vol 2 (1999), pp. 1150–1157.

22.

Yeh

R.A.

, Chen Yian

, Lim

, Schwing

A.G.

, Hasegawa-Johnson

and Do

M.N.

, Semantic image inpainting with deep generative models, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 5485–5493.

23.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative adversarial networks, Communications of the ACM 63(11) (2020), 139–144.

24.

Pérez

, Gangnet

and Blake

, Poisson image editing, In: ACM SIGGRAPH 2003 Papers, (2003), pp. 313–318.

25.

, Lin

, Yang

, Shen

, Lu

and Huang

T.S.

, Generative image inpainting with contextual attention, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 5505–5514.

26.

, Lin

, Yang

, Shen

, Lu

and Huang

T.S.

, Free-form image inpainting with gated convolution, In: Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 4471–4480.

27.

Xiong

, Yu

, Lin

, Yang

, Lu

, Barnes

and Luo

, Foreground-aware image inpainting, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 5840–5848.

28.

, Wang

and Zhu

, Progressive image painting outside the box with edge domain, Journal of Intelligent & Fuzzy Systems 39(1) (2020), 371–381.

29.

, He

, Zhang

, Du

and Tao

, Progressive reconstruction of visual structure for image inpainting, In: Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 5962–5971.

30.

Zhang

, Hu

, Luo

, Zuo

and Wang

, Semantic image inpainting with progressive generative networks, In: Proceedings of the 26th ACM international conference on Multimedia, (2018), pp. 1939–1947.

31.

Yan

, Li

, Zuo

and Shan

, Shift-net: Image inpainting via deep feature rearrangement, In: Proceeding of the European conference on computer vision (ECCV), (2018), pp. 1–17.

32.

S.W.

, Lee

J.Y.

and Kim

S.J.

, Onion-peel networks for deep video completion, In: Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 4403–4412.

33.

Xie

, Liu

, Li

, Cheng

M.M.

, Zuo

, Liu

, Wen

and Ding

, Image inpainting with learnable bidirectional attention maps, In: Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 8858–8867.

34.

Wang

, Li

, Zhang

and Du

, Musical: Multi-scale image contextual attention learning for inpainting, In: IJCAI, (2019), pp. 3748–3754.

35.

, Shen

and Sun

, Squeeze-and-excitation networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

36.

Liu

, Huang

and Wang

, Learning spatial fusion for single-shot object detection, arXiv preprint arXiv:191109516, (2019).

37.

Johnson

, Alahi

and Fei-Fei

, Perceptual losses for real-time style transfer and super-resolution, In: European conference on computer vision, Springer, (2016), pp. 694–711.

38.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:14091556, (2014).

39.

Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

, Huang

, Karpathy

, Khosla

, Bernstein

, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115(3) (2015), 211–252.

40.

Doersch

, Singh

, Gupta

, Sivic

and Efros

, What makes look like paris? ACM Transactions on Graphics 31(4) (2012).

41.

Liu

, Jiang

, Song

, Huang

and Yang

, Rethinking image inpainting via a mutual encoder-decoder with feature equalizations, In: European Conference on Computer Vision, Springer, (2020), pp. 725–741.

42.

Guo

, Yang

and Huang

, Image inpainting via conditional texture and structure dual generation, In: Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 14134–14143.