CVMar 13, 2025

Long-Video Audio Synthesis with Multi-Agent Collaboration

arXiv:2503.10719v25 citationsh-index: 3
Originality Highly original
AI Analysis

This work addresses a critical problem for film and interactive media by enabling more immersive and coherent audio dubbing for long videos, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the challenge of generating synchronized audio for long-form videos, which existing methods struggle with due to semantic shifts and misalignment, by proposing LVAS-Agent, a multi-agent framework that decomposes the process into specialized steps and introduces mechanisms for refinement and alignment, resulting in superior audio-visual alignment over baselines.

Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes