CV AINov 24, 2020

RIN: Textured Human Model Recovery and Imitation with a Single Image

arXiv:2011.12024v41.2

Originality Highly original

AI Analysis

This work provides a method for reconstructing textured 3D human models and performing imitation from a single image, which is significant for applications requiring 3D human avatars with limited input, such as virtual try-on or animation, improving upon methods that struggle with unstable human appearance.

This paper introduces RIN, a volume-based framework that reconstructs a textured 3D human model from a single image and uses it for human imitation. It addresses the challenge of estimating the full human texture from a single view by proposing a U-Net-like front-to-back translation network, enabling reliable estimation of the back view and competitive results compared to multi-view input methods.

Human imitation has become topical recently, driven by GAN's ability to disentangle human pose and body content. However, the latest methods hardly focus on 3D information, and to avoid self-occlusion, a massive amount of input images are needed. In this paper, we propose RIN, a novel volume-based framework for reconstructing a textured 3D model from a single picture and imitating a subject with the generated model. Specifically, to estimate most of the human texture, we propose a U-Net-like front-to-back translation network. With both front and back images input, the textured volume recovery module allows us to color a volumetric human. A sequence of 3D poses then guides the colored volume via Flowable Disentangle Networks as a volume-to-volume translation task. To project volumes to a 2D plane during training, we design a differentiable depth-aware renderer. Our experiments demonstrate that our volume-based model is adequate for human imitation, and the back view can be estimated reliably using our network. While prior works based on either 2D pose or semantic map often fail for the unstable appearance of a human, our framework can still produce concrete results, which are competitive to those imagined from multi-view input.

View on arXiv PDF

Similar