Tangut character image generation based on cycle-consistent adversarial networks

Abstract

Tangut characters were created by the Tangut of the Western Xia (Xi Xia) Dynasty in ancient China and are over 1000 years old. In deep-learning-based recognition studies on Tangut characters, the lack of category-complete datasets has been problematic. Data augmentation cannot augment the character categories of unknown styles, whereas the use of image generation can effectively solve the problem. In this study, we consider the generation of antique book calligraphy styles of Tangut characters as a problem of learning to map from existing printed styles to personalized antique book calligraphy styles. We present M-ResNet, a multi-scale feature extraction residual unit, and Tangut-CycleGAN, a model for generation Tangut characters that combine M-ResNet and a cycle-consistent adversarial network (CycleGAN). This method uses unpaired data to generate Tangut character images in the calligraphy style of ancient books. To enhance the response of the model to significant channels, a squeezing-and-excitation (SE) module is introduced based on Tangut-CycleGAN to design the Tangut-CycleGAN+SE method for generating images of Tangut characters. This method is not only suitable for Tangut character image generation, but also can effectively generate calligraphy with aesthetic value. In addition, we propose an overall quality discrepancy evaluation metric, FA (Fréchet inception distance + Accuracy), to evaluate the quality of character image generation, which combines style discrepancy and content accuracy metrics.

Keywords

Tangut character CycleGAN unpaired data image generation evaluation metric

1 Introduction

The recognition of Tangut characters through deep learning-based recognition methods can help researchers quickly and effectively study the historical, cultural, social, and other fields of the Western Xia Dynasty.

Most of the Tangut characters in ancient books are presented using movable type or are handwritten, which are calligraphic characters with certain artistry. Moreover, there is a distinct style discrepancy between the same category of characters, further increasing the difficulty of recognizing Tangut characters. Recognition methods based on deep learning require enough data and datasets of Tangut characters are scarce. By generating Tangut characters with similar calligraphy styles to ancient books, a category-complete dataset is formed to solve the problem of the lack of datasets faced by deep-learning-based Tangut character recognition. In this study, as the target style, the Tangut characters are generated in the calligraphy style of the ancient books of Tangut, as shown in Fig. 1.

Fig.1

Tangut characters are generated in the target calligraphy style of ancient books. (a) The Four-Corner System (of coding Tangut characters) of Tangut characters. (b) Modern printed Tangut characters. (c) Tangut characters in the calligraphy style of ancient books. (d) Generation of Tangut characters in the calligraphy style of ancient books.

In work-related character generation, alphabet-based languages have an extremely limited number of letters; for example, English has only 26 letters, and its style transform is equivalent to transferring different combinations of these 26 letters, while Tangut has 6077 different characters. Therefore, it is more difficult to generate such characters based on different styles. Because fewer Tangut characters in the style of ancient books are available for known categories, there is a need to automatically generate Tangut characters with such styles on a smaller number of character datasets.

Tangut characters are like Chinese characters, all of which are square shaped. Although no studies on the generation of Tangut characters have yet appeared, there are several existing approaches to the generation of Chinese characters that can be drawn upon. Most earlier studies [1 –3] have relied on a hierarchical representation of simple strokes, which first decomposes Chinese characters into strokes and then combines the strokes to imitate a personalized writing style. Because this approach only focuses on the local representation of the Chinese character, rather than the overall style, the shape, size, and stroke position of each new character needs to be adjusted. By contrast, zi2zi [4] uses paired character images as the training data and learns to transform the font style using pix2pix [5]. However, in the task of generating Tangut characters in the style of ancient books calligraphy, the number of already marked Tangut characters is small, and thus training data for paired Tangut characters are difficult to obtain. In addition, because there are different styles of ancient-style Tangut characters, it is more important to learn the overall style of the character rather than imitating the style of another character. Therefore, the use of unpaired Tangut characters is more appropriate for generating such characters in an ancient book calligraphy style. Fig. 2 shows an example of paired and unpaired training data. The paired training data are a character set having both styles of the same character, while the unpaired character dataset does not necessarily have such a correspondence.

Fig. 2

Paired and unpaired data. (a) The paired data consists of instances ${x_{i}, y_{i}}_{i = 1}^{N}$ , representing the correspondence between x_i and y_i. (b) The unpaired data contains the source set ${x_{i}}_{i = 1}^{N} (x_{i} \in X)$ and the target set ${y_{i}}_{i = 1}^{N} (y_{i} \in Y)$ , but x_i and y_i does not necessarily match.

1.1 Motivation

The lack of corresponding data is an important reason for the low accuracy of Tangut character recognition and study difficult, and the data is one of the important conditions of research methods based on deep learning. The manual collection of Tangut character images is a difficult task, which is not only laborious but also limited in categories and quantity. This has seriously affected the progress of Tangut character recognition. Generating ideal Tangut character images by deep learning techniques will greatly promote the development of Tangut character research. Therefore, the design of the image generation network of Tangut characters is one of the most important topics. However, there is no generation model for the Tangut characters. The main motivation of this paper is to design a Tangut character generation model and expand the Tangut character database through this model.

1.2 Contributions

In this study, for unpaired Tangut characters, the printed Tangut characters are generated into Tangut characters with an ancient book calligraphy style using the Tangut-CycleGAN+SE model. The main contributions of this study are as follows:

We present M-ResNet, a multi-scale feature extraction residual unit, and Tangut-CycleGAN, a model for generation Tangut characters that combine M-ResNet and a CycleGAN. To enhance the response of the model to significant channels, a SE module is introduced based on Tangut-CycleGAN to design the Tangut-CycleGAN+SE method for generating images of Tangut characters.

The overall FA evaluation metrics for generating character images is proposed based on the style discrepancy (FID) and content accuracy (Accuracy).

For Tangut characters, using the model proposed in this study, we can generate high-quality character images, which extends the scale and quality of the Tangut character dataset and solves the problem of missing ancient styles in some categories of the dataset.

1.3 Organization

The rest of the paper is organized as follows. Section 2 describes the work related to image and character generation. Section 3 presents the methodological design of this paper. Section 4 details the related experiments and the analysis based on the experiments. Section 5 gives the conclusion and future directions of the paper.

2 Related work

2.1 Generative adversarial networks

Generative adversarial networks (GANs) [6, 7] are powerful generative models tenerativhat have achieved impressive results in extensive computer vision tasks, such as image restoration [8, 9], image generation [10, 11], image editing [12], representation learning [11, 13], image-to-image translation [14], and natural language processing tasks including speech synthesis [15] and cross-language learning [r16. GANs are used in the design of generative models as a game between two competing networks: A generative network generates synthetic data given the input noise rand the discriminative network distinguishes whether the output of the generative network is real data or not. Formally rthe game between the generator and discriminator has the following minimax objective:

$\begin{matrix} min_{G} max_{D} V (D, G) = E_{x p_{data} (x)} log D (x)] \\ + E_{z p_{z} (z)} [log (1 - D (G (z)))] \end{matrix}$ (1)

where x ∼ p_data (x) is a sample of the input data, z ∼ p_z (z) is a random noise sample, G (z) is the image generated by the neural network generator G, and D (·) is the probability that the input data are real.

cGANs: The nature of a GAN, which does not require pre-built models, leads to a training process that is too free, making such a network less controllable for larger images. To address this problem, Mirza et al. proposed conditional generative adversarial nets (cGANs) [17]. This method turns an unsupervised GAN into a supervised GAN by introducing conditional variables y by adding conditional constraints on the generator G and discriminator D. When adding conditional constraints y to G and D, the original two-player game objective of minimax in a GAN is transformed into a two-player minimax objective with the following conditional: $\begin{matrix} min_{G} max_{D} V (D, G) = E_{x \sim p_{data} (x)} [log D (x | y)] \\ + E_{z \sim p_{z} (z)} [log (1 - D (G (z | y)))] \end{matrix}$ (2)

pix2pix and zi2zi: Here, pix2pix [5] uses cGANs to implement image-to-image transformations. The method not only learns the image-to-image mapping relationship, but also the loss function applied to train the mapping relationship, solving the drawbacks of traditional methods that require manually designed loss functions. However, this method must use pairs of training data, resulting in limited applications.

In addition, zi2zi [4] embeds multiple font style categories on top of pix2pix, enabling an end-to-end character style transfer. Compared to earlier methods of generating and transferring the style of the character as a combination of font strokes and radicals [18, 19], zi2zi discards complex auxiliary information such as stroke marks, allowing for a more flexible generation of multiple styles of fonts and reduced usage costs.

CycleGAN: Cycle-consistent adversarial networks (CycleGANs) use unpaired training data to learn an image-style transformation [14]. The network consists of two generators and two discriminators to form two cycles. The architecture of the generator is the Residual Network (ResNet) unit [20], where the generator G_X-Y generates an image Y_fake with the style of the target domain Y from the image X_real in the source domain X, and determines whether the generated image style is that of the target domain based on the discriminator D_y. The architecture of the discriminator is the PatchGAN [5] classifier. Then, Y_fake is passed through the generator G_Y-X to generate an image $X_{fake}^{'}$ with the style of the source domain. Finally, X_real has been judged to be consistent with X′ through the cycle consistency loss.

2.2 Image-to-image translation

Image-to-image transformation methods can divide into two types according to the type of training data: paired and unpaired training data methods.

Using the nonparametric structure model [21] on a single input to output a training image and use the input-output example dataset to learn the parameter transformation function [22] of a convolutional neural network (CNNs), these methods apply paired training data as data support. The pix2pix framework proposed by Isola et al. has achieved excellent results in several areas, including a semantic layout [23] and sketch-based image synthesis [24]. This framework uses the widely applied cGANs to learn the mapping of the input and output the images.

CycleGANs conduct a style transformation of the images on unpaired training datasets and can effectively address the challenge of image generation when data are unpaired or difficult to pair.

2.3 Handwritten Chinese character generation

Studies [18, 25] have treated Chinese characters as a combination of radicals and strokes, and Chinese characters have been generated using a method applying the shape of the characters, which transforms the shapes of strokes into a parametric representation, which is a concise method of stroke generation. However, this is costly and inflexible. Here, zi2zi [4], an end-to-end method for generating Chinese characters, is a new attempt at Chinese character style translation applying adversarial generation networks. Although the method can directly change the style of Chinese characters, its generated fonts suffer from fuzziness and a lack of stroke fluency, which is even worse on unpaired training data. In the literature [19], an end-to-end style generation model based on generative adversarial networks was able to generate better Chinese characters, with some degree of improvement compared to zi2zi. However, its training data are still paired data and require more data for training, which increases the training cost.

2.4 Tangut character generation

The structure of a Tangut character is more complex than that of a Chinese character, and the Tangut character style transformation method is limited by the difficulty of obtaining paired training data. In this study, we propose Tangut-CycleGAN, a Tangut character generation model based on the CycleGAN multi-scale feature extraction residual unit for unpaired training data, which can effectively generate Tangut character images in the specified style.

3 Method

3.1 Network architecture

This study aims to transfer a print-style Tangut character into a Tangut character having an ancient book calligraphy style. For this task, the model is trained using unpaired training data to generate a character of the target domain style from a character of the source domain style, maximising the ability of the generated character style to be close to the target domain character style, with the same content as the source domain. In this study, such a task is defined as learning a mapping G_X→Y : X → Y from the source domain font style X to the target domain font style Y, where the training data samples are ${x_{i}}_{i = 1}^{N} \in X$ , ${y_{i}}_{i = 1}^{N} \in Y$ . This study draws on CycleGAN-based ideas to transform character images from different domains using unpaired data. The structure and processing flow of CycleGAN are shown in Fig. 3.

Fig. 3

Structure and processing flow of CycleGAN. The cycle marked by the red arrow is the process by which the source domain X generates the target domain Y. The cycle marked by the blue arrow is the process by which the target domain Y generates the source domain X. Here G_X→Y denotes the generator that generates the target domain Y from the source domain X, and D_Y is the discriminator that determines whether the generated image has the same distribution as the target domain. Similarly, G_Y→X denotes the generator that generates the source domain X from the target domain Y, and D_X is the discriminator that determines whether the generated image has the same distribution as the source domain. The generators G_X→Y and G_Y→X, and the discriminators D_X and D_Y, respectively use the same structure.

In this study, we propose Tangut-CycleGAN, in which G : X → Y is equivalent to the generator of the GANs, which consists of an encoder, a transfer module, and a decoder, the structure of which is shown in Table 1. Conv-Norm-ReLU indicates a Convolution-InstanceNorm-ReLU layer and Deconv-Norm-ReLU is a Fractional-strided-convolution-InstanceNorm-ReLU layer. The architecture of the transfer module is more flexible. In CycleGAN, the transfer module contains six ResNet units:

Table 1

The structure of Tangut-CycleGAN

Module		Specifications
Generator	Encoder	7×7 Conv-Norm-ReLU, 64 filters, stride 1
		5×5 Conv-Norm-ReLU, 128 filters, stride 2
		3×3 Conv-Norm-ReLU, 256 filters, stride 2
	Transfer module	M-ResNet unit
		M-ResNet unit
		M-ResNet unit
	Decoder	3×3 Deconv-Norm-ReLU, 128 filters, stride 1/2
		5×5 Deconv-Norm-ReLU, 64 filters, stride 1/2
		7×7 Deconv-Norm-ReLU, 3 filters, stride 1
Discriminator		PatchGAN

$\begin{matrix} x_{l} = H_{l} (x_{l - 1}) + x_{l - 1} \end{matrix}$ (3)

where x_l-1 and x_l are the input and output of the first ResNet unit, respectively, and H_l denotes the composite functions of the convolution operation, batch normalization (BN) [26], and rectified linear unit (ReLU) [27]. The purpose of ResNet using an identity skip connection to bypass nonlinear transformations is to facilitate a gradient back-propagation.

In this study, a multi-scale feature extraction residual unit (M-ResNet) is proposed to process the input at multiple filter sizes, the network structure of which is shown in Fig. 4. Filters with different convolutional kernel sizes can capture different levels of detail, with smaller kernels in the receptive field capturing a small range of detailed information, whereas larger kernels capture a larger range of information and improve the overall credibility. To reduce the number of parameters and the computational cost of the convolution of the model, the M-ResNet unit uses the grouped convolution shown in Fig. 5, where the convolution is the standard convolution when the number of groups is 1, and the total number of parameters is reduced to the original $\frac{1}{G}$ when the number of groups is G.

Fig. 4

Structure of ResNet unit and M-ResNet unit.

Fig. 5

Grouped convolution.

One part of the M-ResNet unit uses a similar structure as the residual network, and the other part uses convolutional kernels of different sizes to capture different levels of detail. The features extracted from the two parts are then concatenated in a multi-scale feature extraction residual unit in the following manner: $x_{l} = H_{l} (x_{l - 1}) + x_{l - 1} + concat (\underset{i = 0}{\overset{n}{H_{j}}} (x_{l - 1}))$ (4)

where H_l (x_l-1) + x_l-1 denotes ResNet unit, H_j represents the composite function when using different sized convolution kernel convolution operations, BN, and ReLUs, where the difference i indicates the difference in the size of the convolution kernels, the size of the convolution kernel for i = 0 is $K_{0}^{2} = 3 \times 3$ , and the number of groups is 1. With each increase in i by 1, the convolution kernel K_i increases by 2, and the number of groups from low to high is 1, 4, 8,..., n. The number of groups convolved from i = 2 onward is in turn twice that of the previous group. In addition, the M-ResNet unit is efficient, flexible, and scalable.

The input to the generator G_X→Y in the Tangut-CycleGAN comes from the source domain X. First, the source domain image is passed through a decoder to obtain features in the lower dimensions of the source domain. The output of the encoder is then passed through the transfer module to obtain the features in the target domain. Finally, the output of the feature from the transfer module is decoded by the decoder.

The adversarial discriminator D uses a PatchGAN with a size of 70×70 to evaluate whether the generated image meets the expectations of the target domain. The mapping G_Y→X : Y → X and corresponding discriminator D_X are similarly defined.

3.2 Loss function

The loss function of the Tangut-CycleGAN consists of two main components: an adversarial loss function and a cycle consistency loss function. The purpose of the adversarial loss function is to match the generated image to the target domain image in terms of the data distribution. For the mapping function G_X→Y : X → Y and its discriminator D_Y, the objectives are $\begin{matrix} L_{GAN} (G_{X \to Y}, D_{Y}, X, Y) = E_{y \sim p_{data} (y)} [log D_{Y} (y)] \\ + E_{x \sim p_{data} (x)} [log (1 - D_{Y} (G_{X \to Y} (x)))] \end{matrix}$ (5) where G_X→Y attempts to generate an image G_X→Y (x) similar to the Y domain image, and D_Y is used to discern the difference between G_X→Y (x) and the real image y. The game between the generator G_X→Y and discriminator D_Y has the minimax objective $min_{G_{X \to Y}} max_{D_{Y}} L_{GAN} (G_{X \to Y}, D_{Y}, X, Y)$ , and a similar definition of adversarial loss for the mapping G_Y→X : Y → X and discriminator D_X. The game between the generator G_Y→X and discriminator D_X has the minimax objective $min_{G_{Y \to X}} max_{D_{X}} L_{GAN} (G_{Y \to X}, D_{X}, Y, X)$ . In addition, the cycle transformations can transform the generated image back into its original style, with the cycle consistency loss defined as follows: $\begin{matrix} L_{cyc} (G_{X \to Y}, G_{Y \to X}) \\ = E_{x \sim p_{data} (x)} [{∥ G_{Y \to X} (G_{X \to Y} (x)) - x ∥}_{1}] \\ + E_{y \sim p_{data} (y)} [{∥ G_{X \to Y} (G_{Y \to X} (y)) - y ∥}_{1}] \end{matrix}$ (6)

Cycle consistency loss can be seen as a regularisation process, the strength of which can be adjusted by λ. The overall objective of the Tangut-CycleGAN is $\begin{matrix} L_{total} (G_{X \to Y}, G_{Y \to X}, D_{X}, D_{Y}) \\ = L_{GAN} (G_{X \to Y}, D_{Y}, X, Y) \\ + L_{GAN} (G_{Y \to X}, D_{X}, Y, X) \\ + λ L_{cyc} (G_{X \to Y}, G_{Y \to X}) \end{matrix}$ (7) where λ controls the relative importance of the two objectives. The overall objective was to address the following: $\begin{matrix} G *, F * = \arg \\ min_{G_{X \to Y}, G_{Y \to X}} max_{D_{X}, D_{Y}} L (G_{X \to Y}, G_{Y \to X}, D_{X}, D_{Y}) \end{matrix}$ (8)

3.3 Tangut-CycleGAN+SE

In previous vision tasks, the SE module [28] achieved excellent results in ImageNet 2017. The SE module is divided into two steps: squeezing and excitation. A squeeze obtains global compressed features of the feature map through global mean pooling, which obtains the weights of each channel in the feature map through a two-layer fully connected bottleneck structure and uses the weighted feature map as the input to the next layer of the network. SE is used in MobileNetV3 [29], MSENet [30], FuSENet [31], and other computer vision tasks, and has achieved fruitful results.

Based on the M-ResNet unit, this study adds the SE attention module to the transfer module to improve the overall quality and details of the generated images. In this study, the SE attention module is added to the first and third layers of the M-ResNet unit, and the final Tangut character generation model Tangut-CycleGAN+SE is constructed. The generator structure of Tangut-CycleGAN+SE is shown in Fig. 6.

Fig. 6

Tangut-CycleGAN+SE generator structure. The source domain image is fed into the transfer module through the encoder E with low-dimensional spatial features, and the output of the transfer module, i.e., the extraction of the target domain style features, is finally decoded by the decoder D.

Tangut-CycleGAN+SE draws on CycleGAN-based ideas to transform Tangut character images of different domains using unpaired data. The M-ResNet unit with efficient, flexible, and scalable design. The capability to extract features at different levels is increased through different receptive fields, which expands the vision of the network to generate target images more accurately. At the same time, the grouped convolution is used to reduce model parameters, lower training difficulty, and improve model efficiency. Since the generative network should be both content-accurate and stylistically similar to the target domain, attention mechanisms are fused at different locations of the network to improve the content and style quality of the generated images.

4 Experiment

4.1 Datasets

Tangut-character dataset. TCD [32] is a Tangut character database for Tangut character recognition. Its images are scanned from more than ten Tangut documents, including data sources of various styles such as ancient movable type printing, modern printing, and handwriting. The proportion of characters from ancient books and modern books is about 8:2. TCD contains 6,077 types of Tangut characters. Of which included 2,110 classes of characters from ancient books, totaling 94,459 characters; 6,077 classes of characters from modern printed editions, totaling 30,165 characters. The ratio of the training set, test set, and validation set is 6:2:2. TCD is one of the most comprehensive, numerous, and stylistically rich databases of Tangut characters available.

TCD-G is a dataset used to generate images of Tangut characters. It is composed of 6077 classes of modern printed Tangut characters and some randomly selected ancient book calligraphy style Tangut characters from the TCD, and the images are resized to a pixel resolution of 128×128 and then binarized. The TCD-G is divided into a training set and a test set. The training set contains two folders, A and B. Folder A contains 818 different images of modern printed Tangut characters, whereas folder B contains 818 images of ancient book calligraphy style Tangut characters. The test and training sets remain the same, with the only difference being that folder A in the test set holds 6077 different Tangut characters, intending to verify the generation result of all such characters.

Lanting calligraphy dataset [33]. The Lanting calligraphy dataset is made up of character images isolated from the most famous work, Lantingji Xu, of the greatest calligrapher in Chinese history, Wang Xizhi, and consists of 324 semi-cursive characters. These 324 semi-cursive characters are used as the target styles, and 324 SIMKAI characters are used as the source styles. For the experiments used in this study, all character images were resized to a pixel resolution of 128×128.

Similar to TCD-G, the Lanting calligraphy dataset is divided into a training set for a test set, where the training set contains two folders, A and B. Folder A contains 300 different images of SIMKAI fonts, whereas folder B contains 300 images of semi-cursive characters. The test and training sets are the same, and both folders A and B contain 24 images.

4.2 Evaluation metrics

There are already many evaluation metrics for generative models, most of which are evaluated independently in terms of both style and content, lacking an overall quality evaluation of character generation models. To evaluate the performance of the Tangut character generation model, in this study, an overall evaluation metric, FA, that combines content and style is proposed.

Style evaluation. The inception score (IS) considers only the generated samples and not the real data, resulting in an inability to reflect the distance between real data and the generated samples. Instead, the Fréchet inception distance (FID) [34] is calculated as the distance between the real and generated samples in the feature space. The smaller the FID value is, the higher the quality of the generated image. The FID is calculated using the following formula: $\begin{matrix} FID (x, g) = {∥ μ_{x} - μ_{g} ∥}_{2}^{2} \\ + Tr (\sum_{x} + \sum_{g} - 2 {(\sum_{x} \sum_{g})}^{\frac{1}{2}}) \end{matrix}$ (9) where μ_x denotes the mean of the real image features, μ_g denotes the mean of the generated image features, $\sum_{x}$ denotes the covariance matrix of the true image features, and $\sum_{g}$ denotes the covariance matrix of the generated image features.

Content evaluation. HCCR-CNN9Layer [35] surpasses the recognition accuracy of HCCR-GoogLeNet [36] in handwritten character recognition and has achieved excellent recognition results on the Chinese handwritten dataset CASIA-HWDB. In this study, relevant tests were also conducted on the TCD, the Top-1 accuracy and Top-5 levels of accuracy of which are listed in Table 2. It can be seen that the HCCR-CNN9Layer model has a high recognition performance and can be used to judge the accuracy of the content generated by the character. In this study, the character recognition accuracy (Accuracy) is referred to as ACC, and the higher the value of ACC is, the higher the accuracy of the generated image content. The ACC is expressed as follows: $\begin{matrix} ACC = \frac{TP + TN}{P + N} \end{matrix}$ (10) where (TP + TN) denotes the number of samples with all correct predictions and (P + N) denotes the total number of samples.

Table 2

The recognition accuracy of HCCR-CNN9Layer on the TCD dataset

Database	Top-1(%)	Top-5(%)
TCD	93.18	96.11

Overall evaluation. This study uses two evaluation metrics, FID and ACC, and proposes an evaluation metric FA that integrates the style discrepancy with the content accuracy, the calculation of which is as follows: $\begin{matrix} FA (FID, ACC) = FID + λ (\frac{1}{ACC} - 1) \end{matrix}$ (11) where FID ∈ [0, + ∞), ACC ∈ (0, 1], and FA ∈ [0, + ∞). The smaller the value of FA is within this interval, the higher the overall quality of the generated image, and the more closely it resembles the target domain. FA converts the value of ACC into the reciprocal $\frac{1}{ACC}$ , whereas $\frac{1}{ACC} - 1$ is used such that the metrics can start from zero. The value of $\frac{1}{ACC} - 1$ is then adjusted to a value similar in order of magnitude to the score of the FA employing parameter λ. According to the empirical values, λ is more appropriate when taken as 100.

To verify the validity of the overall evaluation metrics, this study trained a Tangut character generation model using CycleGAN, then generated nine groups of Tangut characters, and evaluated these nine groups of character images using each of the three evaluation metrics. The results are shown in Figs. 8. In Fig. 7, the accuracy is converted from a percentage into a value from zero to 100 for observation purposes. As shown in Fig. 7, FA provides a more objective evaluation of the overall quality of the generated images relative to the FID and ACC. In this study, two cases that contradict the FID and ACC evaluation metrics were analysed separately.

Fig. 7

Performance of the 3 different evaluation metrics on the 9 sets of generated images.

Fig. 8

Three sets of evaluation metrics for the nine sets of Tangut character images.

Case 1: When the value of the FID is low, the overall quality of the generated image is not necessarily high.

As shown in Fig. 7, in the evaluation of the group 4 images, the FID scores were similar to those of groups 3 and 2. If the evaluation scores of the FID are followed, the quality of the images generated in group 4 should be similar to that of the two groups before. However, it can be seen that the ACC scores of the group 4 images are lower than those of the two groups before, indicating that the accuracy of their image content is lower than these other groups, at which point the FID scores will not accurately reflect the quality of the generated images. By contrast, the FA score, which is the evaluation metric proposed in this study, is higher than that of the two groups of images before, indicating that its overall quality is lower. Among them, groups 2, 3, and 4 of Tangut character images and their three sets of evaluation metrics are shown in Fig. 8.

Case 2: When the value of ACC is high, the overall quality of the generated image is not necessarily high.

As shown in Fig. 7, the ACC score of group 2 was slightly higher than that of group 1. Based on the ACC score, the overall quality of the images of group 2 should be higher than that of group 1. However, group 1 had better results than group 2 in terms of style similarity, as shown in Fig. 8, and it can be seen from the observation that group 1 generates better results than group 2. The scores of the evaluation metrics FA proposed in this study are slightly lower for group 1 than those for group 2, indicating that the overall quality of the generated images is higher for group 1 than for group 2. The three sets of evaluation metrics for groups 1 and 2 of the Tangut character images are shown in Fig. 8.

It is also apparent from the scores of several other groups that the evaluation metric FA more accurately represents the overall quality of the images generated by the model. In the experiments conducted in this study, the performance of the model was evaluated in both quantitative and qualitative terms.

4.3 Experiment environment and implementation details

The generative model architecture in this study was implemented using the Pytorch platform, and the recognition model architecture was implemented using the TensorFlow platform, with all parameters set by default.

Prior to training, the dataset used was pre-processed with the relevant pre-processing, and in generating the experiments, only the training images were resized to a pixel resolution of 128×128, and no other pre-processing methods (e.g. cropping and flipping) were used. During all training, the regularisation strength was set to λ = 10, and a total of 200 batches were trained using the ADAM [37] optimizer with a batch size of 6. The learning rate was 0.0002 for the first 100 batches, after which the learning rate began to decay linearly to zero. The number of iterations in each batch of training depends on the side having a higher number of images between the two styles.

For quick completion, all image generating experiments in this study were conducted on a PC with a 3.40-GHz Intel(R) Core(TM) i7-6700 CPU, 16 GB of RAM, an NVIDIA TITAN V 20 GB GPU, and Windows 10. All scoring tests and recognition experiments were conducted on another PC with a 3.50-GHz Intel(R) Xeon(R) E5-1620 v3 CPU, 64 GB of RAM, a 39.9-GB NVIDIA GeForce RTX 2080 GPU, and Windows 10.

4.4 Generated results of Tangut characters

In this study, we use printed Tangut characters as the source font and processed ancient book calligraphy style Tangut characters as the target font. Because the number of characters in some styles is limited in practical applications, the goal of this study is to use as few training sets as possible to train the model and to test the style transformation effect on the test set.

4.4.1 Ablation experiments using M-ResNet units

The purpose of this experiment was to examine the optimal number of M-ResNet units, where the number of M-ResNet units of the transfer module was set from 1 to 6. The results are shown in Table 3, which indicates that when the units of M-ResNet are set to 1 to 5, respectively, the values of FID and FA are smaller than those of CycleGAN (ResNet-6) [14], which shows that the M-ResNet units have an obvious advantage in terms of style and overall image quality. There is a clear advantage in generating high-quality Tangut character images in terms of style details and overall quality, and the overall quality of the generated character is the highest when the number of units of M-ResNet is 1, 2, or 3, at which point the FA scores are reduced by 14.21, 13.10, and 11.67, respectively, as compared to ResNet-6.

Table 3
Experiment results for the three metrics of M-ResNet unit versus ResNet unit

Model FID ACC(%) FA

ResNet-6 [14] 79.96 94.97 85.25

M-ResNet-6 103.63 96.72 107.02

M-ResNet-5 79.64 95.11 84.78

M-ResNet-4 68.60 86.60 83.91

M-ResNet-3 61.73 89.40 73.58

M-ResNet-2 65.26 93.55 72.15

M-ResNet-1 66.01 95.21 71.04

Model	FID	ACC(%)	FA
ResNet-6 [14]	79.96	94.97	85.25
M-ResNet-6	103.63	96.72	107.02
M-ResNet-5	79.64	95.11	84.78
M-ResNet-4	68.60	86.60	83.91
M-ResNet-3	61.73	89.40	73.58
M-ResNet-2	65.26	93.55	72.15
M-ResNet-1	66.01	95.21	71.04

4.4.2 Ablation experiments using SE modules

(1) Ablation experiments with the addition of single-SE modules

This experiment is based on the experiment described in 4.4.1, which aims to examine the effect of an image generation after the model is added to the SE module. We added the SE module to M-ResNet-1, M-ResNet-2, and M-ResNet-3, the experiment results of which are shown in Table 4. Here, +SE0, +SE1, and +SE2 denote the addition of the SE module after the first, second, and third modules of the M-ResNet unit, respectively, and all occurrences of +SEn after that are the same. Among all model combinations, the FA scores of M-ResNet-2+SE0 and M-ResNet-3+SE2 were 67.91 and 71.79, respectively, which were significantly lower than those of the other models. This is a decrease of 4.24 and 1.79, respectively, compared to before the introduction of the SE module. In the three combinations of M-ResNet-3, the primary stage of the network is biased to focus more on image content through the channel attention, whereas at the end of the network, it is biased to focus more on image style through the channel attention.

Table 4
Experiment results of adding SE modules to M-ResNet-1, M-ResNet-2, and M-ResNet-3

Model FID ACC(%) FA

M-ResNet-1+SE0 82.44 96.77 85.77

M-ResNet-2+SE0 58.69 91.55 67.91

M-ResNet-2+SE1 72.12 96.48 75.76

M-ResNet-3+SE0 68.70 93.21 75.98

M-ResNet-3+SE1 73.24 89.79 84.61

M-ResNet-3+SE2 58.26 88.08 71.79

Model	FID	ACC(%)	FA
M-ResNet-1+SE0	82.44	96.77	85.77
M-ResNet-2+SE0	58.69	91.55	67.91
M-ResNet-2+SE1	72.12	96.48	75.76
M-ResNet-3+SE0	68.70	93.21	75.98
M-ResNet-3+SE1	73.24	89.79	84.61
M-ResNet-3+SE2	58.26	88.08	71.79

(2) Ablation experiments with the addition of multiple SE modules

In the ablation experiments with the addition of a single SE module, it was found that the addition of channel attention to the first and third M-ResNet units can improve the overall quality of the generated images in terms of both content and style. Therefore, further experiments were conducted in this study with different combinations of the location and number of SE additions, the results of which are shown in Table 5. The FA scores of the M-ResNet-3+SE0+SE2 model with SE modules added after the first and third M-ResNet units, respectively, were reduced by 27.24 compared to that of ResNet-6 and 15.57 compared to that of M-ResNet-3. At the same time, the FA scores were reduced by more than 10 compared to the other methods of adding SE modules. The experiment results show that the M-ResNet-3+SE0+SE2 model is the most effective for a style transformation among the several models compared.

Table 5

Experiment results of adding multiple SE modules to M-ResNet-2 and M-ResNet-3

Model	FID	ACC(%)	FA
M-ResNet-2+SE0+SE1	72.72	96.97	75.84
M-ResNet-3+SE0+SE1	69.88	88.67	82.65
M-ResNet-3+SE0+SE2	49.31	91.99	58.01
M-ResNet-3+SE1+SE2	69.63	90.42	80.22
M-ResNet-3+SE0+SE1+SE2	57.36	90.08	68.37

4.4.3 Method comparison experiment

This experiment compares the strengths and weaknesses of the proposed model with those of other models in both quantitative and qualitative terms. Among them, a natural style transfer [38] used VGG-19 as a pre-training model, with relu4_2 of VGG-19 applied for content loss, and relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1 used for style loss; in addition, paired data were applied for training, the best results were selected, and the other methods used unpaired data. Coordinate attention (CA) [39] is an attention mechanism that embeds location information into channel attention. This attention accomplishes the encoding of precise location information into channel relationships and long-term dependencies through two steps: coordinate information embedding and coordinate attention generation. It considers not only the relationship between channels but also the location information in the feature space. CA has been widely used in classical mobile networks such as MobileNet V2 [40], MobileNeXt [41], and EfficientNet [42]. CA0 and CA2 denote the addition of CA modules in the first and third bottleneck layers, respectively. M-ResNet-3+CA0 and M-ResNet-3+CA0+CA2 are the two optimal models among all combinations selected in this study.

In Table 6, the ACC values of the proposed method differ less than those of the other methods, but significantly outperform the other methods for two metrics, the FID and FA, where the reduction in both metrics is more than 10 in comparison to DensNet CycleGAN (DensNet-5) [33] and ResNet-6. In Fig. 9, it can be seen that the method described in this study has more advantages in the detailed processing of character images, particularly the model M-ResNet-3+SE0+SE2, which can generate more realistic images, including the stroke of the character, which can be better captured.

Table 6
Experiment results of different methods on the Tangut character dataset

Model FID ACC(%) FA

DensNet-5 [33] 106.67 85.15 124.10

ResNet-6 [14] 79.96 94.97 85.25

M-ResNet-3+CA0 62.41 90.62 72.76

M-ResNet-3+CA0+CA2 65.43 89.26 77.46

M-ResNet-3 (Our) 61.73 89.40 73.58

M-ResNet-3+SE0+SE2 (Our) 49.31 91.99 58.01

Model	FID	ACC(%)	FA
DensNet-5 [33]	106.67	85.15	124.10
ResNet-6 [14]	79.96	94.97	85.25
M-ResNet-3+CA0	62.41	90.62	72.76
M-ResNet-3+CA0+CA2	65.43	89.26	77.46
M-ResNet-3 (Our)	61.73	89.40	73.58
M-ResNet-3+SE0+SE2 (Our)	49.31	91.99	58.01

Fig. 9

Different methods used to generate ancient book calligraphy style Tangut characters. (a) Printed characters, (b)–(f) characters generated using natural style transfer, DensNet-5, ResNet-6, M-ResNet-3, and M-ResNet-3+SE0+SE2, respectively, and (g) actual ancient book calligraphy style of Tangut characters. The images generated through the natural style transfer have major flaws and even errors in the content. The parts of the character generated in the first row (c)–(e) circled in red show various degrees of defects in content or style, whereas the character generated in (f) is closer to the actual ancient book calligraphy style of Tangut characters in terms of content and style.

Figure 10 shows the generation results of M-ResNet-3+SE0+SE2 for 10 similar Tangut characters that were not used in the model training, and Fig. 11 shows a comparison of the results of 10 randomly selected images of Tangut characters generated by different models. As shown in Figs. 10, the Tangut character generated by M-ResNet-3+SE0+SE2 is closer to the content and style of the Tangut character found in ancient books calligraphy.

Fig. 10

Results generated using M-ResNet-3+SE0+SE2 for 10 similar Tangut characters with Four-Corner System 2486–2495. The first row shows the printed characters; the second row shows the generated ancient book calligraphy style characters. It is noteworthy that none of these Tangut characters were involved in the training, and the generated Tangut characters are still clearly visible, of high overall quality, and with a similar calligraphy style and aesthetics to an ancient book of Tangut.

Fig. 11

Results of the proposed method and several other approaches for generating ancient book calligraphy style Tangut characters. (a) Printed characters, (b)–(e) characters generated using DensNet-5, ResNet-6, M-ResNet-3, and M-ResNet-3+SE0+SE2, respectively, and (f) actual ancient book calligraphy style Tangut character characters. Among them, the characters generated in (b)–(e) have a greater enhancement in terms of content, style, and details, particularly the characters generated by (e), which is most similar to an actual ancient book calligraphy style Tangut character overall.

4.4.4 Recognition experiments

In this study, the generated Tangut characters were added to the original Tangut character dataset (TCD) for training and testing, and the experiment results are shown in Fig. 12. When 5, 10, and 15 images generated by the model of this study were added to each category of Tangut characters on top of the TCD dataset, the accuracy of Tangut character recognition showed an increasing trend in sequence. Compared with TCD, the Top-1 accuracy increased by 3.31%, and the Top-10 accuracy increased by 1.63% for TCD+15. The results show that the Tangut character generated in this study helps to improve the accuracy of Tangut character recognition and can effectively extend the Tangut character dataset.

Fig. 12

Recognition accuracy after adding the generated Tangut character.

4.5 Experiment results of calligraphy

To verify the general validity of the methods described in this study, an experiment was carried out on the calligraphic characters of Lantingji Xu. As shown in Fig. 13, experiments on character image generation were conducted for four methods, and it can be seen from the experiment results that several methods captured the overall character style of Wang Xizhi and generated reasonable character images. Compared with ResNet-6, M-ResNet-3 and M-ResNet-3+SE0+SE2 generate images with fewer missed strokes, and the character thickness, stroke fluency, and overall quality generated by the method proposed in this study are better than the other methods, indicating that the method proposed herein yields better results in terms of the key details and overall style. However, these four methods are limited in their ability to learn certain features of Wang Xizhi’s semi-cursive style. For example, in the last line of the character qing, the strokes circled in red have a joined writing style in Lanting calligraphy, whereas the character ‘qing’ generated by the four methods did not learn this style.

Fig. 13

Chinese calligraphic characters in the Lanting calligraphy dataset. (a) Lanting calligraphy true characters, (b)–(e) Wang Xizhi style characters generated using DenseNet-5, ResNet-6, M-ResNet-3, and M-ResNet-3+SE0+SE2, respectively.

5 Conclusion

In this study, we defined an ancient book calligraphy style Tangut character generation through learning to map printed Tangut characters into ancient book calligraphy style Tangut characters. A combination of an M-ResNet unit and an SE attention mechanism was proposed for the design of Tangut-CycleGAN+SE, a generative model for a Tangut character style transfer, which generates Tangut character images with accurate content and a style close to the Tangut characters of the ancient book calligraphy style. On this basis, an overall quality evaluation metric, FA, was proposed by combining the accuracy of the content and a style discrepancy. Many experiments proved the effectiveness of this method, and excellent results were achieved in the generation of Tangut characters.

In this paper, we propose a Tangut character generative model that can generate high-quality images of ancient book calligraphy style Tangut characters, which fills the gap of lack of ancient book calligraphy style fonts in many categories to a certain extent, but there are still many problems and directions that deserve continued research and exploration in the future. They mainly include the following three aspects.

Optimize the network, design a more lightweight generation network, improve the quality of generated images, and enrich the image style of the Tangut character database.

The method in this paper can only achieve the generation of a single character, and the future study of whole-page text is also one of the important directions, which will rapidly improve the generation efficiency and open up a new direction for information text recognition.

The style of Tangut character generation is now only a single mapping, one-to-many style mapping is also an important research direction, and will be extended to other texts to generate a variety of style fonts. For example, art fonts, personalized signatures, and other tasks.

Acknowledgments

This work was supported by the Basic Scientific Research in Central Universities of North Minzu University (FWNX21, 2021KJCX09), the Natural Science Foundation of Ningxia Province (2020AAC3215, 2022AAC03268), the Innovation Team of Computer Vision and Virtual Reality of North Minzu University, and the National Natural Science Foundation of China (61462002).

Appendices

Derivation of Equation (11):

∵ The smaller the FID value is, the closer the style of the generated image to the target style, and the partial derivative of Equation (11) for the FID results in ${FA}_{FID}^{'} (FID, ACC) = 1 > 0$ .

∴ FA has a positive relationship with FID; in addition, the smaller the FID value is, the smaller the FA value, indicating that the style of the generated image is closer to the target style.

∵ As shown in Fig. 14, $f (ACC) = \frac{1}{ACC}$ and g (ACC) = ACC are reciprocal to each other, and the smaller the value of g (ACC) is within the interval (0, 1], the more exponentially f (ACC) grows. In addition, $\frac{1}{ACC} - 1$ and f (ACC) have the same increasing and decreasing properties. The accuracy of the content is crucial to the overall quality of the text-generated image. Therefore, it is reasonable and necessary for f (ACC) to exponentially increase the penalty of the overall quality of the generated image when the content does not match that of the source domain. (As shown in Fig. 8, although the two datasets, groups 5 and 9, have similar FID values, group 9 has a 48.47% decrease in ACC values compared to group 5, whereas FA increases by 123.96 and its penalty increase exponentially.) The partial derivative of Equation (11) for the ACC results in ${FA}_{ACC}^{'} (FID, ACC) = - \frac{λ}{{ACC}^{2}} < 0$ .

∴ FA has a negative relationship with ACC, and as ACC increases, FA will show a decreasing trend, and the rate of decrease will become slower. The larger the ACC value is, the smaller the FA value, indicating that the style of the generated image is closer to the target style.

Fig. 14

Graph of $f (x) = \frac{1}{x}$ (red) and f (x) = x (blue) when x ∈ (0, 4).

∴ In summary, the smaller the FID value and the larger the ACC value, the smaller the FA value is, indicating that the character image generated by the model is more similar to the target domain image. When FID = 0 and ACC = 1, the FA obtains a minimum value of zero.

Declaration of competing interest

The authors declare that there is no conflict of interest related to the submission of the paper.

References

, Jin

, Jiang

and Lau

F.C.

, Automatic generation of personal chinese handwriting by capturing the characteristics of personal handwriting, in: Twenty-First IAAI Conference, 2009, pp. 191–196.

, Lau

F.C.

, Cheung

W.K.

and Pan

, Automatic generation of artistic chinese calligraphy, IEEE Intelligent Systems 20(3) (2005), 32–39.

Liu

, Xu

and Lin

, Automatic generation of personalized chinese handwriting characters, in: 2012 Fourth International Conference on Digital Home, IEEE, 2012, pp. 109–116.

Tian

, zi2zi: Master chinese calligraphy with conditional adversarial networks, Internet, https://github.com/kaonashi-tyc/zi2zi (2017).

Isola

, Zhu

J.Y.

, Zhou

and Efros

A.A.

, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative adversarial nets, Advances in Neural Information Processing Systems 27(2014), 1–9.

Zhao

, Mathieu

and LeCun

, Energy-based generative adversarial network, arXiv preprint arXiv:1609.03126 (2016).

Pathak

, Krahenbuhl

, Donahue

and Darrell

, Efros

A.A.

, Context encoders: Feature learning by inpainting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.

Iizuka

, Simo-Serra

and Ishikawa

, Globally and locally consistent image completion, ACM Transactions on Graphics(ToG) 36(4) (2017), 1–14.

10.

Denton

E.L.

, Chintala

, Szlam

and Fergus

, Deep generative image models using a laplacian pyramid of adversarial networks, Advances in Neural Information Processing Systems 28 (2015), 1486–1494.

11.

Radford

, Metz

and Chintala

, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434 (2015).

12.

Zhu

J.Y.

, Krähenbühl

, Shechtman

and Efros

A.A.

, Generative visual manipulation on the natural image manifold, in: European Conference on Computer Vision, Springer, 2016, pp. 597–613.

13.

Mathieu

M.F.

, Zhao

J.J.

, Zhao

, Ramesh

, Sprechmann

and LeCun

, Disentangling factors of variation in deep representation using adversarial training, Advances in Neural Information Processing Systems 29 (2016), 5047–5055.

14.

Zhu

J.Y.

, Park

, Isola

and Efros

A.A.

, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.

15.

Oord

A.V.D.

, Dieleman

, Zen

, Simonyan

, Vinyals

, Graves

, Kalchbrenner

, Senior

and Kavukcuoglu

, Wavenet: A generative model for raw audio, arXiv preprint arXiv:1609.03499 (2016).

16.

Joty

, Nakov

, Màrquez

and Jaradat

, Cross-language learning with adversarial neural networks: Application to community question answering, arXiv preprint arXiv:1706.06749 (2017).

17.

Mirza

and Osindero

, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784 (2014).

18.

Mao

and Pan

, A style transformation method for printed chinese characters, in: International Conference on Image and Graphics, Springer, 2015, pp. 494–503.

19.

Chang

and Gu

, Chinese typography transfer, arXiv preprint arXiv:1707.04904 (2017).

20.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

21.

Efros

A.A.

and Leung

T.K.

, Texture synthesis by non-parametric sampling, in: Proceedings of the Seventh IEEE International Conference on Computer Vision 2 (1999), 1033–1038.

22.

Long

, Shelhamer

and Darrell

, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

23.

Karacan

, Akata

, Erdem

and Erdem

, Learning to generate images of outdoor scenes from attributes and semantic layouts, arXiv preprint arXiv:1612.00215 (2016).

24.

Sangkloy

, Lu

, Fang

, Yu

and Hays

, Scribbler: Controlling deep image synthesis with sketch and color, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5400–5409.

25.

Sun

, Luo

and Lu

, Unsupervised typography transfer, arXiv preprint arXiv:1802.02595 (2018).

26.

Ioffe

and Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, PMLR, 2015, pp. 448–456.

27.

Glorot

, Bordes

and Bengio

, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2011, pp. 315–323.

28.

, Shen

and Sun

, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

29.

Howard

, Sandler

, Chen

, Wang

, Chen

L.C.

, Tan

and Chu

V.G.

, Searching for mobilenetv3, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324.

30.

Bodapati

J.D.

, Shareef

S.N.

, Naralasetti

and Mundukur

N.B.

, Msenet: multi-modal squeeze-and-excitation network for brain tumorseverity prediction, International Journal of Pattern Recognition and Artificial Intelligence 35(07) (2021), 2157005.

31.

Roy

S.K.

, Dubey

S.R.

, Chatterjee

and Chaudhuri

B.B.

, Fusenet: Fused squeeze-and-excitation network for spectral-spatial hyperspectral image classification, IET Image Processing 14(8) (2020), 1653–1661.

32.

, Cao

, Ma

, Wei

and Hao

, End-to-end tangut character database building and recognition method, IET Image Processing (2022) 1–14.

33.

Chang

, Zhang

, Pan

and Meng

, Generating hand-written chinese characters using cyclegan, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 199–207.

34.

Heusel

, Ramsauer

, Unterthiner

, Nessler

and Hochreiter

, Gans trained by a two time-scale update rule converge toa local nash equilibrium, Advances in Neural Information Processing Systems 30 (2017), 6626–6637.

35.

Xiao

, Jin

, Yang

, Sun

and Chang

, Buildingfast and compact convolutional neural networks for offlinehandwritten chinese character recognition, Pattern Recognition 72 (2017), 72–81.

36.

Zhong

, Jin

and Xie

, High performance offline hand-written chinese character recognition using googlenet and directional feature maps, in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2015, pp. 846–850.

37.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

38.

Gatys

L.A.

, Ecker

A.S.

and Bethge

, A neural algorithm of artistic style, arXiv preprint arXiv:1508.06576 (2015).

39.

Hou

, Zhou

and Feng

, Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13713–13722.

40.

Howard

, Zhmoginov

, Chen

L.C.

, Sandler

and Zhu

, Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation (2018).

41.

Zhou

, Hou

, Chen

, Feng

and Yan

, Rethinking bottleneck structure for efficient mobile network design, in: European Conference on Computer Vision, Springer, 2020, pp. 680–697.

42.

Tan

and Le

, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114.