CVNov 22, 2022

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

arXiv:2211.12368v1126 citationsh-index: 45
Originality Incremental advance
AI Analysis

This addresses the efficiency bottleneck for real-time talking portrait synthesis in applications like virtual avatars or video conferencing, representing an incremental improvement over existing NeRF methods.

The paper tackles the slow training and inference speed of dynamic Neural Radiance Fields (NeRF) for talking portraits by proposing an efficient framework that decomposes the representation into low-dimensional feature grids, achieving real-time synthesis and faster convergence while maintaining high-fidelity rendering.

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes