CVSDASJun 6, 2023

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

arXiv:2306.03504v28 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses a novel problem in the digital human industry for creating talking avatars with minimal data, though it is incremental as it builds on existing TTS and neural rendering techniques.

The paper tackles the low-resource text-to-talking avatar synthesis task by developing Ada-TTA, which generates high-quality talking portrait videos from arbitrary text using limited training data, achieving realistic, identity-preserving, and audio-visually synchronized results.

We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes