CVMMSDApr 16

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

arXiv:2604.1458081.4h-index: 13
AI Analysis

Enables real-time deployment of audio-driven talking avatar generation for latency-sensitive applications.

TurboTalk compresses a multi-step audio-driven video diffusion model into a single-step generator, achieving 120x inference speedup while maintaining high quality.

Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes