DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures
This work addresses the need for more natural and integrated talking video generation for applications like virtual avatars, though it is incremental by building on existing diffusion and motion modeling techniques.
The paper tackled the problem of generating coherent and diverse co-speech gestures in audio-driven talking videos by introducing DiffTED, a diffusion-based method that produces temporally coherent TED-style videos from a single image, with experiments showing improved gesture diversity and coherence.
Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.