SAiD: Speech-driven Blendshape Facial Animation with Diffusion
This work addresses the problem of speech-driven facial animation for creators and researchers, offering incremental improvements in diversity and efficiency.
The paper tackles the challenge of generating diverse and synchronized 3D facial animations from speech by proposing SAiD, a diffusion model with cross-modality alignment, and introduces BlendVOCA, a new benchmark dataset. The results show comparable or superior lip synchronization to baselines, more diverse lip movements, and streamlined animation editing.
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.