Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation
This addresses the challenge of spatial manipulation in person image generation for applications like animation and view synthesis, though it is incremental as it builds on existing spatial transformation methods.
The paper tackles the problem of pose-guided person image generation and animation by proposing a differentiable global-flow local-attention framework to spatially transform inputs at the feature level, demonstrating superiority in experiments for image generation and animation tasks.
Pose-guided person image generation and animation aim to transform a source person image to target poses. These tasks require spatial manipulation of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. This framework first estimates global flow fields between sources and targets. Then, corresponding local source feature patches are sampled with content-aware local attention coefficients. We show that our framework can spatially transform the inputs in an efficient manner. Meanwhile, we further model the temporal consistency for the person image animation task to generate coherent videos. The experiment results of both image generation and animation tasks demonstrate the superiority of our model. Besides, additional results of novel view synthesis and face image animation show that our model is applicable to other tasks requiring spatial transformation. The source code of our project is available at https://github.com/RenYurui/Global-Flow-Local-Attention.