CVIVApr 29, 2021

Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary

arXiv:2104.14631v342 citations
Originality Incremental advance
AI Analysis

This work addresses video synthesis for applications like virtual avatars or content creation, but it is incremental as it builds on existing talking face generation techniques.

The paper tackles the problem of synthesizing talking-head videos from text by building a personalized phoneme-pose dictionary and using a GAN, resulting in reduced training data needs, improved flexibility, and faster processing times compared to audio-driven methods.

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes