CVFeb 18, 2025

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

arXiv:2502.13133v111 citationsh-index: 88
Originality Highly original
AI Analysis

This addresses the need for interactive audio-visual avatars in applications like virtual assistants or entertainment, representing a novel method rather than an incremental improvement.

The paper tackles the problem of generating photo-realistic 4D talking avatars from text input, achieving human-like speech synthesis, synchronized lip motion, facial expressions, and head pose, with results outperforming prior work.

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes