CVSep 11, 2024

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Steven Hogue, Chenxu Zhang, Hamza Daruger, Yapeng Tian, Xiaohu Guo

arXiv:2409.07649v117.826 citationsh-index: 28

Originality Incremental advance

AI Analysis

This work addresses the need for more natural and integrated talking video generation for applications like virtual avatars, though it is incremental by building on existing diffusion and motion modeling techniques.

The paper tackled the problem of generating coherent and diverse co-speech gestures in audio-driven talking videos by introducing DiffTED, a diffusion-based method that produces temporally coherent TED-style videos from a single image, with experiments showing improved gesture diversity and coherence.

Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

View on arXiv PDF

Similar