CVASFeb 20, 2020

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

arXiv:2002.08700v24 citations
AI Analysis

This work addresses the problem of generating natural and efficient virtual news anchors for low-delay applications, representing an incremental improvement in neural-based lip-sync techniques.

The paper tackles the challenge of synthesizing high-resolution, photorealistic virtual news anchors by developing a neural lip-sync framework that uses Temporal Convolutional Networks and neural rendering to map audio to mouth movements, achieving improved visual appearance and efficiency over existing methods.

Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with existing methods. This paper presents a novel lip-sync framework specially designed for producing high-fidelity virtual news anchors. A pair of Temporal Convolutional Networks are used to learn the cross-modal sequential mapping from audio signals to mouth movements, followed by a neural rendering network that translates the synthetic facial map into a high-resolution and photorealistic appearance. This fully trainable framework provides end-to-end processing that outperforms traditional graphics-based methods in many low-delay applications. Experiments also show the framework has advantages over modern neural-based methods in both visual appearance and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes