CVCLSDASIVFeb 28, 2023

UniFLG: Unified Facial Landmark Generator from Text or Speech

arXiv:2302.14337v211 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the integration of text-driven and speech-driven talking face generation, enabling applications in scenarios with limited data, though it appears incremental in combining existing frameworks.

The paper tackles the problem of generating facial landmarks from either text or speech by proposing UniFLG, a unified system that uses end-to-end text-to-speech to extract common latent representations, achieving higher naturalness in speech synthesis and landmark generation compared to state-of-the-art methods.

Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes