CVDec 13, 2022

Structured 3D Features for Reconstructing Controllable Avatars

arXiv:2212.06820v325 citationsh-index: 65
Originality Highly original
AI Analysis

This addresses the problem of creating detailed, editable 3D human avatars from limited input for applications like virtual try-on and animation, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles the problem of creating controllable 3D avatars from single images by introducing Structured 3D Features, which uses an implicit 3D representation to pool image features onto semantic 3D points from a human mesh, enabling modeling of accessories, hair, and clothing. The result is a transformer-based framework that generates animatable 3D reconstructions with albedo and illumination decomposition, surpassing previous state-of-the-art on tasks like monocular 3D reconstruction and albedo/shading estimation.

We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes