Deep Image Spatial Transformation for Person Image Generation
This addresses the problem of generating realistic person images under new poses for applications like video animation and view synthesis, representing an incremental improvement in spatial transformation methods.
The paper tackles pose-guided person image generation by proposing a differentiable global-flow local-attention framework to spatially transform source images into target poses, achieving superior results in both subjective and objective experiments.
Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.