Perceptual metric-guided human image generation

Abstract

Pose transfer, which synthesizes a new image of a target person in a novel pose, is valuable in several applications. Generative adversarial networks (GAN) based pose transfer is a new way for person re-identification (re-ID). Typical perceptual metrics, like Detection Score (DS) and Inception Score (IS), were employed to assess the visual quality after generation in pose transfer task. Thus, the existing GAN-based methods do not directly benefit from these metrics which are highly associated with human ratings. In this paper, a perceptual metrics guided GAN (PIGGAN) framework is proposed to intrinsically optimize generation processing for pose transfer task. Specifically, a novel and general model-Evaluator that matches well the GAN is designed. Accordingly, a new Sort Loss (SL) is constructed to optimize the perceptual quality. Morevover, PIGGAN is highly flexible and extensible and can incorporate both differentiable and indifferentiable indexes to optimize the attitude migration process. Extensive experiments show that PIGGAN can generate photo-realistic results and quantitatively outperforms state-of-the-art (SOTA) methods.

Keywords

Deep learning GAN pose transfer attention human pose

1. Introduction

Human pose transfer is an instantiation task that synthesizes a new image of a target person in a novel pose given a single image of the person [1, 2]. A few generated samples can be seen in Fig. 1. This topic is valuable in several applications, like video generation, movie making, and person re-identification [3, 4, 5, 6, 7, 8].

Figure 1.

Generated samples by PIGGAN based on Market-1501.

A lot of promising frameworks have been proposed for pose transfer task [9, 10, 11]. In order to generate realistic images, three main ideas have been used in previous work. The first kind of methods [12, 13] split the generation process into a coarse-to fine manner. The work in [9] generated the target images in two different stages. They further proposed a two-stage framework that can manipulate three different parts of the source image [16]. However, their methods require complicated training procedure and high computational cost. The second kind of methods [14, 15, 16, 17] propose to generate human images conditioned on an end-to-end approach. Tang et al. [17] proposed the Xing generator to learns a deformable translation mapping between the source image and the target image. The Xing generator consists of two different generation branches and can effectively update the personhape and appearance. However, that the goal of the generation phase and the evaluation phase are not completely consistent [18, 13]. Specifically, In the evaluation phase, these models employ Structural Similarity (SSIM) and other perceptual indexes to evaluate the gap between the generated human image and the target human image [19, 20]. In the generation phase, the mean square error loss (MSELoss, which calculates the mean square error of the difference between output and target) and the L1 norm loss (L1Loss, which calculates the absolute value of the difference between output and target) are currently used as the loss functions [9, 21, 22, 23, 24].

The aforementioned existing methods fail to synthesize images with fine-grained feature due to that the goal of the generation phase and the evaluation phase are not completely consistent in their methods [3, 25, 26, 27]. To address the above drawbacks, this paper introduces a general model-Evaluator. Evaluator can simulate any evaluation indicators and give the optimizer an explicit goal (as a loss function) for optimizing the model. Specifically, according to the protocol of [28]. Evaluator employs a Siamese CNN network for the Siamese CNN structure exhibit extremely good performance in learning a general similarity function between image patches [25, 29, 9, 14]. It can simulate the perceptual index through learning the sorting approach [3, 30, 31, 32]. It is worth noting that Evaluator does not directly give a specific absolute value, but rather gives a sorting value for the generated image. Just like in the real world, people donive a specific value when evaluating pictures, but they sort and compare them to give a relative value [3, 33, 34, 35]. Compared with previous works, the framework uses a trained director to make the framework have a SL, which can directly measure the quality of the generated picture [36, 37].

To train the proposed Evaluator, the results of different pose transfer algorithms (they have different perceptual index values) were marked to build a new data set. The Evaluator is trained on this dataset, after which it has the ability to sort the generated pictures according to the perceptual index value. Note that it does not mean the upper bound of Evaluator is the best one of chosen human generation methods. Then, the performance of the Evaluator is largely determined by the chosen attitude migration network. To achieve the best performance, this work has selected three SOTA models that have appeared in the past three years: $PG^{2}$ , DSC, and PATN [1, 12, 2]. After using Evaluator, the generator can generate pictures with high ranking.

Yet, there are still several great challenges needed to be addressed during the PIGGAN training process. Human pose transfer has attracted enormous attention recently, which is to transform one personosture into another while keeping the appearance details [1, 2]. Most previous works utilize multiple perceptual indexes to evaluate generated images for using a single perceptual index cannot explicitly quantify generated images. However, setting too many targets for the generator makes PIGGAN difficult to train [38, 21, 22]. What is more, different generated images may have opposite ranking order on different evalution metrics in some cases. For these reasons, this paper employs SSIM, Inception Score (IS), and Detection Score (DS) in this study to give the optimizer a clear goal for optimizing perceptual quality. These perceptual Indexes are widely used in most recent works [39, 40, 41, 42]. This paper also uses the recent introduced Percentage of Correct Keypoints (PCKh) which measures the shape consistency of generated images [17].

The proposed method achieves qualitative and quantitative excellent results on challenging baselines. Moreover, the method in this paper can be used to improve pedestrian re-identification tasks. Specifically, the main contributions are:

This study introduce a novel model-Evaluator that can simulate any perceptual metrics in pose transfer task, and thus integrates the evaluation indicators into the generative model.

a novel SL that can encourage the model is proposed to optimize the generator with multiple perceptual metrics and achieve the most advanced performance. This paper also introduces a texture attention module (TAM) to guide the module by hinting “where to add more texture”.

The proposed PIGGAN is highly flexible and can simulate any perception metric to optimize the attitude migration process. Experiments show that PIGGAN outperforms SOTA methods.

We demonstrate the advantage of our method over the state-of-the-arts by quantitative and qualitative evaluation, and show the capability to alleviate data insufficiency for person re-identification.

2. Related works

2.1 Human image generation

Recently, human image generation is a crucial sub-area of computer vision [9, 43, 44, 45]. Lassner et al. [37] exploited to combine GAN and VAE to generate full-body images by a differentiable two-stage model. In [46], Zanfir et al. proposed a similar 3D model to explicitly capture the body deformations. To better model appearance, Zhao et al. [47] adopted a more general approach for synthesizing human images from a single-view. Similarly, the work in [48] presented a modular GAN network that generates unseen poses using training pairs of images and poses.

2.2 Pose transfer

GAN is widely used in multimedia areas including works on pose transfer modeling [41, 49, 40]. These works can be decomposed into two categories. The first group of works split the generation process into a coarse-to fine manner. The work in [12] generated the target images in two different stages. They further proposed a two-stage framework that can manipulate three different parts of the source image [13]. However, their methods require complicated training procedure and high computational cost. In contrast, this study introduces an end-to-end generation method which obtains higher qualitative results.

The second line of works generates human images conditioned on an end-to-end approach. Zhu et al. [1] presented a pose-attentional transfer blocks (PATB) to optimize their model using the attention mechanism. However, PATB cannot capture long-range dependency to transfer the precise regions of the target image features. Likewise, Siarohin et al. [2] adopt a GAN-based approach with deformable skip connections. Tang et al. [17] proposed the Xing generator to learn a deformable translation mapping between the source image and the target image. The Xing generator consists of two different generation branches and can effectively update the personhape and appearance. However, most previous works [1, 2, 17] use the mean square error loss and the L1 norm loss used as the loss functions. In the generation phase, perceptual indexes such as SSIM are used in their works to evaluate the generated images. This may lead the goal of the generation phase and the evaluation phase are not completely consistent. In contrast, this paper designs a data-derived Evaluator that matches well the GAN. In this way, The PIGGAN is able to utilize the perceptual metrics to directly integrated them into the loss function of GAN.

This paper is organized as follows. In Section 3, we present the proposed PIGGAN network. In Section 4, we present the experimental results and application to person re-identification. Finally, Section 5 presents our conclusions.

3. The proposed method

Given a person image, the pose transfer model aims at generating an image for the person in another pose [39, 50, 51]. Private works gained considerable success on image synthesis [52, 53, 54]. The proposed framework consists of one generator and six discriminators. To guide the pose transfer process, this paper adopts the Human Pose Estimator (HPE) [1] to obtain the 2D human body poses. Specifically, this paper adopts 18 joints of a human body that are extracted using OpenPose [1] for fair comparison.

3.1 Overview of PIGGAN

Rank dataset. First, different pose transfer methods were employed to generate target human images on public pose transfer datasets. Then a chosen perceptual metric (e.g. SSIM) were employed on the generated images. To achieve the best performance, this work has employed four perceptual metrics (SSIM, IS, DS, PCKh) [1,4,9]. After that, we can pick up two images of the same content could be selected to form a pair. Note that the pair-wise images were ranked according to the quality score calculated by the chosen perceptual metric. Finally, the pair-wise images were obtained. This study utilizes these pair-wise data to form the rank dataset. The associated sorting labels of every image pair were set to (100, 10, 1) (the one with the highest value is set to 1). After that, the Evaluator is employed to learn the sorting orders. Therefore, the rank dataset set can be used as the arning material of Evaluator. In particular, given two images $x_{1}$ , $x_{2}$ , the ranking scores $s_{1}$ and $s_{2}$ can be obtained by

$\displaystyle s_{1}=R(x_{1};\Theta_{R})$ (1) $\displaystyle s_{2}=R(x_{2};\Theta_{R}),$ (2)

where $\Theta_{R}$ denotes the network weights. $R(.)$ represents the mapping function of the Evaluator. The sorting score can be formulated as:

$\displaystyle\left\{\begin{array}[]{lll}s_{1}<s_{2}&\text{ if }&m_{x_{1}}<m_{x% _{2}}\\ s_{1}>s_{2}&\text{ if }&m_{x_{1}}>m_{x_{2}},\\ \end{array}\right.$ (3)

where $m_{x_{1}}$ and $m_{x_{2}}$ indicate the quality scores of image $x_{1}$ and image $x_{2}$ , respectively.

Figure 2.

The structure of the Evaluator.

Figure 3.

The structure of PIGGAN.

Train Evaluator. Figure 2 showcase the structure of the Evaluator. After training, Evaluator has the capacity to sort images based on the basis of their perceptual scores. Mathematically:

$\displaystyle L(s_{1},s_{2};\gamma)=\max(0,(s_{1}^{2}-s_{2}^{2})*\gamma+z)$ $\displaystyle\left\{\begin{array}[]{lll}\gamma=1&\text{ if }&m_{x_{1}}<m_{x_{2% }}\\ \gamma=-1&\text{ if }&m_{x_{1}}>m_{x_{2}},\\ \end{array}\right.$ (4)

where $s_{1}$ and $s_{2}$ denote the sorting scores of synthesized images. $\gamma$ indicates the sorting label of synthesized images. The margin z can control the distance between $s_{1}$ and $s_{2}$ .

Train PIGGAN. Figure 3 showcase the structure of PIGGAN. Once the Evaluator is well-trained. It will be assembled into the PIGGAN to generate visually pleasing images.

3.2 Generator

As shown in Fig. 3, the generator takes three inputs, the condition image $I_{C}$ and a pair of target poses ( $P_{C}$ , $P_{t}$ ). This study utilizes several PATB to acquire the texture from $I_{C}$ . As in [1], this paper introduces a texture attention module (TAM). Figure 3 showcase the structure of the TAM. The input of TAM is the input feature $F_{t-1}$ obtained after being processed by PATB module. The model utilizes TAM to acquire a reasonable refined feature $F_{t}$ . Specifically, TAM aims to guide the module by hinting “where to add more texture”. In this way, TAM can make $F_{t}$ truly guide the transformation of image features.

3.3 Discriminators

The discriminators in this work are composed of $D_{A}$ , $D_{S}$ , $\textit{Evaluator}_{\textit{SSIM}}$ , $\textit{Evaluator}_{IS}$ , $\textit{Evaluator}_{DS}$ , and $\textit{Evaluator}_{\textit{PCKh}}$ . $\textit{Evaluator}_{\textit{SSIM}}$ , $\textit{Evaluator}_{IS}$ , $\textit{Evaluator}_{DS}$ , and $\textit{Evaluator}_{\textit{PCKh}}$ are trained separately. Their training is before PIGGAN is trained. The results of different pose transfer algorithms (which have different perceptual index values) were marked to build a new dataset. The Evaluator is trained on this dataset and these four Evaluator have the same structure. Specifically, this work has selected three SOTA models that have appeared in the past three years: PG2, DSC, and PATN [1, 12, 55]. These structures can lead the network to generate more realistic and sharper images with rich details.

$D_{A}$ and $D_{S}$ Discriminators.

As in [1], this paper adopts two discriminators, pose discriminator $D_{S}$ and appearance discriminator $D_{A}$ . $D_{A}$ is designed to measure how likely It contains the identical person in $I_{C}$ . $D_{S}$ is designed to judge how well $P_{c}$ aligns with $P_{t}$ . The output of $D_{A}$ is the appearance consistency score $S_{ac}$ . and the outputs of $D_{S}$ is the pose consistency score $S_{pc}$ . More details can be found in [1].

Evaluator. Evaluator consists of two identical network branches. Each branch consists of several full-connected, convolutional, Mish, and pooling layers. This study marks the results of different SOTA pose transfer algorithms (they have different perceptual index values) to build a new data set. Specifically, PIGGAN obtains the sorting order of any two pose transfer images according to their perceptual value. Then the proposed Evaluator is employed to learn the ranking orders. Note that the aim of model is not to obtain a specific absolute value of the generated image since the model is only concerned with sorting information [56, 57, 40, 52].

3.4 Training

The full loss function for the pose transfer network is denoted as:

$\displaystyle\mathcal{L}_{\textit{full}}=\alpha\mathcal{L}_{\textit{GAN}}+% \mathcal{L}_{\textit{combL1}}+\beta\mathcal{L}_{R},$ (5)

where $\mathcal{L}_{\textit{GAN}}$ denotes the adversarial loss and $\mathcal{L}_{\textit{combL1}}$ denotes the combined L1 loss. $\mathcal{L}_{R}$ represents SL. $\alpha$ denotes the weight of $\mathcal{L}_{\textit{GAN}}$ . $\beta$ denotes the weight of $\mathcal{L}_{R}$ . The $\mathcal{L}_{\textit{GAN}}$ is defined as:

$\displaystyle\mathcal{L}_{\textit{GAN}}=\mathbb{E}_{P_{t}\in\mathcal{I}_{S},(I% _{c},I_{t})\in\mathcal{P}}\{\log D(I_{c},I_{t},P_{t})\}{}+\mathbb{E}_{S_{t}\in% \mathcal{P}_{S},P_{c}\in\mathcal{P},P_{g}\in\hat{\mathcal{P}}}\{1-\log D(G(I_{% c},I_{t},P_{t},I_{g}))\},$ (6)

Where $\mathcal{P}_{S}$ represents the distribution of person poses. $\mathcal{P}$ represents the distribution of real person images. $\hat{\mathcal{P}}$ represents the distribution of fake person images. $I_{g}$ represents the generated image. $D$ is a fully-convolutional discriminator conditioned on ( ${I}_{C}$ , ${I}_{t}$ , ${P}_{t}$ , ${I}_{g}$ ), respectively.

The $\mathcal{L}_{\textit{combL1}}$ is defined as:

$\displaystyle\mathcal{L}_{\textit{combL1}}=\lambda_{1}\|P_{g}-P_{t}\|_{1}+% \lambda_{2}\mathcal{L}_{\textit{per}}$ (7)

where

$\displaystyle\mathcal{L}_{\textit{per}}=\frac{1}{\textit{WHC}}\sum_{\rho}% \lambda_{\rho}\|\phi_{\rho}(P_{g})-\phi_{\rho}(P_{t})\|_{1},$ (8)

where $\rho$ denotes the layer index. $\phi_{\rho}$ represents the result of a layer. $W$ , $H$ , $C$ represent height and depth, and spatial width of $\phi_{\rho}$ .

The $\mathcal{L}_{R}$ is defined as:

$\displaystyle\mathcal{L}_{R}=\operatorname{Softplus}(R(G(x_{i}))),$ (9)

where $R(G(x_{i}))$ represents the sorting score of the synthesized human image.

4. Experiments

4.1 Datasets

This study evaluates PIGGAN on the person re-ID dataset Market-1501, which contains images of 1501 persons [1]. All images are resized to 128 $\times$ 64 pixels. Performing the pose transfer task on Market-1501 is challenging for the diversity of viewpoint, pose, illumination, and background. As in [1], 12000 testing pairs and 263,632 training pairs are adopted for Market-1501. This study also carry out experiments on DeepFashion dataset (In-shop Clothes Retrieval Benchmark) [58, 14, 3, 2]. This dataset composed of 52,712 in-shop clothes images. As in [1], 12,000 testing pairs and 101,966 training pairs are adopted for DeepFashion.

For the quantitative study, previous works [12, 9, 3, 2] employed IS, SSIM, mask-IS, mask-SSIM, and DS as their metrics. These metrics evaluate the differences between the generated image and the target image from different aspects. In [1], PCKh was introduced to further evaluate the the shape consistency of similarity between generated images and real ones. This paper employs all the above metrics for fair comparisons.

Table 1
Comparison with the SOTA

Model	Market-1501						DeepFashion
	SSIM	IS	Mask-SSIM	Mask_IS	DS	PCKh	SSIM	IS	DS	PCKh
Ma et al. [12]	0.253	3.460	0.792	3.435	–	–	0.762	3.090	–	–
Ma et al. [13]	0.099	3.483	0.614	4.491	–	–	0.614	3.228	–	–
Siarohin et al. [2]	0.290	3.185	0.805	3.502	0.72	–	0.756	3.439	0.96	–
Ma et al. [12]	0.261	3.495	0.782	3.367	0.39	0.73	0.773	3.163	0.951	0.89
Siarohin et al. [2]	0.291	3.230	0.807	3.502	0.72	0.94	0.760	3.362	0.967	0.94
Esser et al. [54]	0.266	2.965	0.793	3.549	0.72	0.92	0.763	3.440	0.972	0.92
Zhu et al. [1]	0.311	3.323	0.811	3.773	0.74	0.94	0.773	3.209	0.976	0.96
Yang et al. [15]	–	–	–	–	–	–	0.780	3.230	–	–
Li et al. [16]	0.315	3.487	0.814	3.867	–	0.94	0.775	3.338	–	0.95
Tang et al. [17]	0.313	3.506	0.816	3.872	–	0.93	0.778	3.476	–	0.95
Ours	0.327	3.331	0.818	3.812	0.74	0.95	0.789	3.314	0.979	0.96
Real Data	1.000	3.890	1.000	3.706	0.74	1.00	1.000	4.053	0.968	1.00

Figure 4.

Generated samples by PIGGAN based on Market-1501 dataset. Zoom in for better details.

Figure 5.

Generated samples by PIGGAN based on DeepFashion dataset. Zoom in for better details.

4.2 Comparison with previous work

Quantitative comparison. Quantitative comparison on Market-1501 and Deep-fashion can be found in Table 1. Note that the results of Market-1501 are not reported for [28] because the authors have not released the generated images nor the code for the Market-1501 dataset. Since previous works do not provide the data split, this work reevaluates these methods on the testing set. Overall, PIGGAN outperforms baseline models on most metrics, which can verify the efficacy of the network. Specifically, for Market-1501, PIGGAN has the best result in terms of SSIM, mask-SSIM, DS, and PCKh metrics. The IS scores are much higher than those obtained by [1] but lower than the scores obtained by [2]. this is due to the fact that the images generated by [2] look very sharp but probably they are relatively easy for a detector to recognize. For DeepFashion, PIGGAN reports the highest performance according to the mask-SSIM and the DS metrics. Specifically, the PCKh score of PIGGAN on DeepFashion is the same as baseline models mainly for 0.96, which almost hits the upper limit. All in all, the results are particularly evident that the synthesized images are sharp and realistic.

Qualitative comparison. Figures 4 and 5 give some typical qualitative examples. Overall, PIGGAN generates sharp and plausible human images with higher visual quality. Moreover, PIGGAN can restore detailed appearance attributes. Specifically, PIGGAN gives the correct clothing leg layouts in the target pose (in the first and fourth rows) on the Market-1501 dataset. Besides, even if the condition image is blurred (in the second row) or has complex clothing patterns (in the third row) on the Market-1501 dataset, PIGGAN can learn the style of the garments and maintain these features in the generated images. As for DeepFashion, even if the condition image is blurred (in the second row) or has complex clothing patterns (in the fourth row), PIGGAN also keeps appearance consistency. PIGGAN can learn the style of the garments and maintain these features in the generated images. Moreover, PIGGAN preserves fine texture and cloth styles and achieves the best appearance consistency.

Comparison of computation complexity. Table 2 showcase the computation complexity results. These methods were evaluated under one NVIDIA Titan RTX GPU. According to the protocol of [1], this study only takes into account the GPU time. The XingGAN model takes 22 hours(approximately 700 epochs) to achieve the results reported in Table 1. However, because PIGGAN can simulate these evaluation indicators and give the optimizer a clear goal, the proposed method takes only 10 hours (about 400 epochs) to achieve the results reported in Table 1.

Table 2
Comparison of computation complexity

Method	Params	Time
Ma et al. [12]	437.09 M	36 h
Sharohin et al. [2]	82.08 M	27 h
Esser et al. [54]	139.36 M	24 h
Zhu et al. [1]	41.36 M	22 h
Tang et al. [17]	40.26 M	22 h
Ours	41.77 M	10 h

Table 3

User study (%)

Model	Market-1501		DeepFashion
	R2G	G2R	R2G	G2R
Ma et al. [12]	11.2	5.5	9.2	14.9
Siarohin et al. [2]	22.67	50.24	12.42	24.61
Zhu et al. [1]	32.23	63.47	19.14	31.78
Tang et al. [17]	34.26	64.73	20.19	32.31
Ours	37.65	67.87	23.17	34.83

Table 4

The results of the ablation study

Model	Market-1501						DeepFashion
	SSIM	IS	Mask-SSIM	Mask-IS	DS	PCKh	SSIM	IS	DS	PCKh
Baseline	0.311	3.323	0.811	3.773	0.74	0.94	0.773	3.209	0.976	0.96
Evaluator	0.324	3.327	0.815	3.797	0.75	0.95	0.776	3.302	0.978	0.96
Full	0.327	3.331	0.818	3.812	0.74	0.95	0.789	3.314	0.979	0.96

Table 5

The re-ID results using images generated by different methods

Aug.model	Portion p										Aug.ratio
	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1	2	3
None (res)	9.2	27.6	41.5	50.3	56.2	58.8	61.2	62.7	63.8	65.3	–	–
VUNet [54]	49.8	51.7	53.7	54.5	56.6	58.4	59.4	61.3	62.9	65.3	63.9	64.1
Deform [2]	51.9	53.9	55.4	56.1	57.6	59.4	60.5	62.2	63.5	65.3	64.2	64.6
PATN [1]	52.6	54.5	56.5	56.6	60.3	60.9	62.1	63.3	64.8	65.3	65.3	65.7
Ours	53.4	55.8	56.9	57.4	60.7	61.4	62.6	63.9	64.9	65.3	65.7	65.9

User study. This work performs a user study with 30 users for both datasets [59, 60, 61]. Following the protocol in [1, 62, 63, 57, 64], 55 real and 55 synthesized images were randomly showcased in one second. The user will judge (real/fake) each image instantly. Note that the first 10 images are used for practice. The results in Table 3 show that PIGGAN can synthesize more photo-realistic images.

4.3 Ablation study

This study performs the ablation study to further analyze the impact of each component in PIGGAN. “Baseline” means only using PATN branch. “Evaluator” means adopting the proposed Evaluator in the baseline branch. “Full” is our PIGGAN model.

Baseline. PATN [1] is used as the baseline.

Evaluator. The Evaluator was introduced to process the generated human image, where other settings were the same as in the baseline.

Full. The full framework is shown in Fig. 3.

As shown in Table 4, there are significant improvements from the baseline methods to the Evaluator. “Full” is slightly better than “Evaluator” in both datasets. Specifically, “Evaluator” has better results in terms of DS on the Market-1501 dataset.

4.4 Application to re-ID

Many person-related vision tasks, like re-ID, are confronted with insufficient training data problems [65, 66, 67, 68, 69, 70, 71, 72, 73]. A good person pose transfer method can augment the datasets of person-related vision tasks by generating realistic person images [74, 75, 76, 77, 78]. Person re-identification has been drawing lots of attention from both academia and industry for its important applications in security and surveillance [79, 80, 81, 82]. This paper also evaluates PIGGAN on the mainstream re-ID dataset Market-1501. Following the protocol in [1, 12], a portion p of the real data was randomly selected as the reduced training set. Meanwhile, this paper employs the same data augmentation. The results in Table 5 show that PIGGAN achieves consistent improvements over baseline models, suggesting that the proposed method can generate more realistic human images and be more effective for the re-ID task. the numeric improvements are steady for different portion p.

5. Conclusions

For the pose transfer task, this study proposes a novel perceptual index guided GAN framework. Specifically, a general model-Evaluator that matches well the GAN is designed. This study trains the proposed Evaluator to simulate perceptual metrics and construct the SL to optimize the perceptual quality. The discriminators of this work are composed of $D_{A}$ , $D_{S}$ , $\textit{Evaluator}_{\textit{SSIM}}$ , $\textit{Evaluator}_{IS}$ , $\textit{Evaluator}_{DS}$ , and $\textit{Evaluator}_{\textit{PCKh}}$ . Extensive experiments, show that PIGGAN can generate photo-realistic human images and significantly improve the computational efficiency. Moreover, the proposed method has a good effect on alleviating the insufficient training data problem substantially. Evaluator is suitable for other conditioned generation tasks. In the future, we will deploy the proposed Evaluator to several applications, like video generation, movie making, and person re-identification.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant No. 62072348), the National Key R&D Program of China under (Grant No. 2019YFC1509604), the Science and TechnologyMajor Project of Hubei Province (Next-Generation AI Technologies) (Grant No. 2019AEA170), and the National Natural Science Foundation of China under Grant 62102268.

References

Zhu

Huang

Shi

Wang

Bai

. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 2347–2356.

Siarohin

Sangineto

Lathuiliere

Sebe

. Deformable gans for pose-based human image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 3408–3416.

Liang

Wang

Tian

Zou

. PCGAN: Partition-Controlled Human Image Generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33; 2019. pp. 8698–8705.

Cheng

Behzadan

Noshadravan

. Deep learning for post-hurricane aerial damage assessment of buildings. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(6): 695–710.

Snaiki

. A knowledge-enhanced deep reinforcement learning-based shape optimizer for aerodynamic mitigation of wind-sensitive structures. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(6): 733–746.

Liu

Wang

Chen

Wei

. Dynamic event-based state estimation for delayed artificial neural networks with multiplicative noises: A gain-scheduled approach. Neural Networks. 2020; 132: 211–219.

Cheng

Wang

Wei

Liu

. On Adaptive Learning Framework for Deep Weighted Sparse Autoencoder: A Multiobjective Evolutionary Algorithm. IEEE Transactions on Cybernetics. 2020.

Zhao

. Optimal state estimation for finite-field networks with stochastic disturbances. Neurocomputing. 2020; 414: 238–244.

Huang

Loy

. Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 3693–3702.

10.

Song

Zhang

Liu

Mei

. Unsupervised person image generation with semantic parsing transformation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 2357–2366.

11.

Grigorev

Sevastopolsky

Vakhitov

Lempitsky

. Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 12135–12144.

12.

Jia

Sun

Schiele

Tuytelaars

Van Gool

. Pose guided person image generation. In: Advances in Neural Information Processing Systems; 2017. pp. 406–416.

13.

Sun

Georgoulis

Van Gool

Schiele

Fritz

. Disentangled person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 99–108.

14.

Sun

Xiao

Liu

Wang

. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. pp. 5693–5703.

15.

Yang

Wang

Zhang

Wang

Gao

Ren

, et al. Region-adaptive texture enhancement for detailed person image synthesis. In: 2020 IEEE International Conference on Multimedia and Expo; 2020.

16.

Zhang

Liu

Lai

Dai

. PoNA: Pose-guided non-local attention for human pose transfer. IEEE Transactions on Image Processing. 2020; 29: 9584–9599.

17.

Tang

Bai

Zhang

Torr

Sebe

. Xinggan for person image generation. In: European Conference on Computer Vision. Vol. 12370; 2020. pp. 717–734.

18.

Liu

Dohler

Deng

. Vibrotactile quality assessment: Hybrid metric design based on SNR and SSIM. IEEE Transactions on Multimedia. 2020; 22(4): 921–933.

19.

Liu

Zhou

Geng

. Automatic seizure detection based on S-Transform and deep convolutional neural network. International Journal of Neural Systems. 2020; 30(04).

20.

Feng

Halm-Lutterodt

Tang

Mecum

Mesregah

, et al. Automated MRI-based deep learning model for detection of Alzheimer’s disease process. International Journal of Neural Systems. 2020; 30(06).

21.

Haoran

Chen

Pan

. MLFS-CCDE: Multi-objective large-scale feature selection by cooperative coevolutionary differential evolution. Memetic Comput. 2021; 13(1).

22.

Liang

Zeng

. 3D mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integrated Computer-Aided Engineering. 2020; 27(4): 417–435.

23.

Leming

Górriz

Suckling

. Ensemble deep learning on large, mixed-site fMRI datasets in autism and other tasks. International Journal of Neural Systems. 2020; 30(07).

24.

Lozano

Suárez

Soto-Sánchez

Garrigós

Martínez-Alvarez

Ferrández

, et al. Neurolight: A deep learning neural interface for cortical visual prostheses. International Journal of Neural Systems. 2020; 30(09).

25.

Radford

Metz

Chintala

. Unsupervised representation learning with deep convolutional generative adversarial networks. In: 4th International Conference on Learning Representations; 2015.

26.

Johnson

Alahi

. Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision. Vol. 9906; 2016. pp. 694–711.

27.

Zhang

Liu

Dong

Qiao

. Ranksrgan: Generative adversarial networks with ranker for image super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision; 2019. pp. 3096–3105.

28.

Zeng

Wang

Liu

Alsaadi

, et al. Deep-reinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip. Neurocomputing. 2021; 425: 173–180.

29.

Sahoo

Hoi

. Recent advances in deep learning for object detection. Neurocomputing. 2020; 396: 39–64.

30.

Wang

Tan

. Improving metaheuristic algorithms with information feedback models. IEEE Transactions on Cybernetics. 2017; 49(2): 542–555.

31.

Wei

Wang

. Hybrid annealing krill herd and quantum-behaved particle swarm optimization. Mathematics. 2020; 8(9).

32.

Gao

Wang

Pedrycz

. Solving fuzzy job-shop scheduling problem using DE algorithm improved by a selection mechanism. IEEE Transactions on Fuzzy Systems. 2020; 28(12): 3265–3275.

33.

Zhang

Liu

Wang

Zhao

Maybank

. Self-taught semisupervised dictionary learning with nonnegative constraint. IEEE Transactions on Industrial Informatics. 2019;16(1): 532–543.

34.

Shi

Bai

Yao

. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2016; 39(11): 2298–2304.

35.

Bai

Liu

. Multiple Comparative Attention Network for Offline Handwritten Chinese Character Recognition. In: 2019 International Conference on Document Analysis and Recognition; 2019. pp. 595–600.

36.

Isola

Zhu

Zhou

Efros

. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 1125–1134.

37.

Lassner

Pons-Moll

Gehler

. A generative model of people in clothing. In: Proceedings of the IEEE International Conference on Computer Vision; 2017. pp. 853–862.

38.

Bai

Yang

Huang

Dou

. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recognition. 2020; 98.

39.

Luo

Zeng

Yaqian

. A novel whale optimization algorithm with filtering disturbance and non-linear step. International Journal of Bio-Inspired Computation. 2021.

40.

Zhang

. DRCDN: Learning deep residual convolutional dehazing networks. The Visual Computer. 2020; 36(9): 1797–1808.

41.

Quan

. A multi-phase blending method with incremental intensity for training detection networks. The Visual Computer. 2021; 37(2): 245–259.

42.

Chen

Zhang

. A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Applied Soft Computing. 2020; 93.

43.

Pumarola

Agudo

Sanfeliu

Moreno-Noguer

. Unsupervised person image synthesis in arbitrary poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 8620–8628.

44.

Sánchez-Reolid

Martínez-Rodrigo

López

Fernández-Caballero

. Deep support vector machines for the identification of stress condition from electrodermal activity. International Journal of Neural Systems. 2020; 30(07).

45.

Kim

Lee

. Style-controlled synthesis of clothing segments for fashion image manipulation. IEEE Transactions on Multimedia. 2020; 22(2): 298–310.

46.

Zanfir

Popa

Zanfir

Sminchisescu

. Human appearance transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 5391–5399.

47.

Zhao

Cheng

Liu

Jie

Feng

. Multi-view image generation from a single-view. In: ACM International Conference on Multimedia; 2018. pp. 383–391.

48.

Balakrishnan

Zhao

Dalca

Durand

Guttag

. Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 8340–8348.

49.

Pan

. Learning social representations with deep autoencoder for recommender system. World Wide Web. 2020; 23(4): 2259–2279.

50.

Abualigah

Khader

Hanandeh

. A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Engineering Applications of Artificial Intelligence. 2018; 73: 111–125.

51.

Kwon

Mun

. A method to minimize the data size of a lightweight model for ship and offshore plant structure using part characteristics. Journal of Marine Science and Engineering. 2020; 8(10): 763.

52.

Zhu

. Avoiding critical members in a team by redundant assignment. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2018; 50(7): 2729–2740.

53.

Kwon

Mun

. Part recognition-based simplification of triangular mesh models for ships and plants. The International Journal of Advanced Manufacturing Technology. 2019; 105(1): 1329–1342.

54.

Adeli

Hung

. Fuzzy neural network learning model for image recognition. Integrated Computer-Aided Engineering. 1993; 1(1): 43–55.

55.

Neverova

Alp Guler

Kokkinos

. Dense pose transfer. In: The European Conference on Computer Vision. Vol. 11207; 2018. pp. 123–138.

56.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

, et al. Generative adversarial nets. In: Advances in Neural Information Processing Systems; 2014. pp. 2672–2680.

57.

Liang

Zeng

Luo

. An improved Loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering. 2021.

58.

Liu

Luo

Qiu

Wang

Tang

. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 1096–1104.

59.

Zhang

Zheng

Zhao

. A generative adversarial network for travel times imputation using trajectory data. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(2): 197–212.

60.

Maeda

Kashiyama

Sekimoto

Seto

Omata

. Generative adversarial network for road damage detection. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(1): 47–60.

61.

Mishra

Piciarelli

Foresti

. A neural network for image anomaly detection with deep pyramidal representations and dynamic routing. International Journal of Neural Systems. 2020; 30(10).

62.

Hung

Adeli

. Parallel backpropagation learning algorithms on cray Y-MP8/864 supercomputer. Neurocomputing. 1993; 5(6): 287–302.

63.

Adeli

Hung

. An adaptive conjugate gradient learning algorithm for efficient training of neural networks. Applied Mathematics and Computation. 1994; 62(1): 81–102.

64.

Adeli

Hung

. Machine learning. neural networks, genetic algorithms, and fuzzy systems; 1999.

65.

Salimans

Goodfellow

Zaremba

Cheung

Radford

Chen

. Improved techniques for training gans. In: Advances in Neural Information Processing Systems; 2016. pp. 2234–2242.

66.

Ahmadlou

Adeli

. Enhanced probabilistic neural network with local decision circles: A robust classifier. Integrated Computer-Aided Engineering. 2010; 17(3): 197–210.

67.

Rafiei

Adeli

. A new neural dynamic classification algorithm. IEEE Transactions on Neural Networks and Learning Systems. 2017; 28(12): 3074–3083.

68.

Pereira

Piteri

Souza

Papa

Adeli

. FEMa: A finite element machine for fast learning. Neural Computing and Applications. 2020; 32(10): 6393–6404.

69.

Alam

KMR

Siddique

Adeli

. A dynamic ensemble learning algorithm for neural networks. Neural Computing and Applications. 2020; 32(12): 8675–8690.

70.

Cai

Liu

. Probabilistic vehicle weight estimation using physics-constrained generative adversarial network. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(6): 781–799.

71.

Gao

Zhai

Mosalam

. Balanced semisupervised generative adversarial network for damage assessment from low-data imbalanced-class regime. Computer-Aided Civil and Infrastructure Engineering. 2021; 36(9): 1094–1113.

72.

Snell

Ridgeway

Liao

Roads

Mozer

Zemel

. Learning to generate images with perceptual similarity metrics. In: IEEE International Conference on Image Processing; 2017. pp. 4277–4281.

73.

Chen

Luo

. Multi-objective self-organizing optimization for constrained sparse array synthesis. Swarm And Evolutionary Computation. 2020; 58.

74.

Benamara

Val-Calvo

Alvarez-Sanchez

Díaz-Morcillo

Ferrandez-Vicente

Fernández-Jover

, et al. Real-time facial expression recognition using smoothed deep neural network ensemble. Integrated Computer-Aided Engineering. 2021; 28(1): 97–111.

75.

Macias-Garcia

Galeana-Perez

Medrano-Hermosillo

Bayro-Corrochano

. Multi-stage deep learning perception system for mobile robots. Integrated Computer-Aided Engineering. 2021; 28(2): 191–205.

76.

Jose Gomez-Silva

de la Escalera

Maria Armingol

. Back-propagation of the Mahalanobis distance through a deep triplet learning model for person Re-Identification. Integrated Computer-Aided Engineering. 2021; 28(3): 277–294.

77.

Peng

Xie

Wei

. A deep fourier neural network for seizure prediction using convolutional neural network and ratios of spectral power. International Journal of Neural Systems. 2021; 31(08).

78.

Ozdemir

Cura

Akan

. Epileptic eeg classification by using time-frequency images for deep learning. International Journal of Neural Systems. 2021; 31(08).

79.

Kim

Park

Adeli

. Evolutionary learning based sustainable strain sensing model for structural health monitoring of high-rise buildings. Applied Soft Computing. 2017; 58: 576–585.

80.

Rafiei

Adeli

. NEEWS: A novel earthquake early warning model using neural dynamic classification and neural dynamic optimization. Soil Dynamics and Earthquake Engineering. 2017; 100: 417–427.

81.

Zhao

Zhang

Dong

Yuan

Zheng

. Graph attention network with focal loss for seizure detection on electroencephalography signals. International Journal of Neural Systems. 2021; 31(7).

82.

Mao

Jin

Miao

Cichocki

. The influ- ence of visual attention on the performance of a novel tactile P300 brain-computer interface with cheeks-stim paradigm. International Journal of Neural Systems. 2021; 31(4).

Perceptual metric-guided human image generation

Abstract

Keywords

1. Introduction

2.1 Human image generation

2.2 Pose transfer

3. The proposed method

3.1 Overview of PIGGAN

3.3 Discriminators

3.4 Training

4.1 Datasets

Table 1 Comparison with the SOTA

Table 2 Comparison of computation complexity

4.4 Application to re-ID

5. Conclusions

Footnotes

Acknowledgments

References

Table 1
Comparison with the SOTA

Table 2
Comparison of computation complexity