CVFeb 13, 2025

LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh

Jing Wen, Alexander G. Schwing, Shenlong Wang

arXiv:2502.09617v117.49 citationsh-index: 4ICLR

Originality Highly original

AI Analysis

This work provides a faster and higher-quality method for generalizable human avatar rendering from sparse inputs, which is significant for applications requiring real-time animation and high visual fidelity.

This paper tackles the challenge of generalizable human avatar rendering from sparse inputs, aiming for fast and high-resolution reconstruction. The authors propose an iterative feedback update framework for canonical human shape representation and a coupled-multi-resolution Gaussians-on-Mesh representation. Their method reconstructs an animatable representation in less than 1s, renders views at 95.1FPS at 1024x1024 resolution, and achieves a PSNR of 24.65 on THuman2.0, outperforming state-of-the-art in rendering quality.

Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at $1024 \times 1024$, and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.

View on arXiv PDF

Similar