SDCVMar 10, 2025

ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation

arXiv:2503.07217v31 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the challenge of compelling audio storytelling in movies, which is incremental as it builds on existing multimodal frameworks by adding multi-agent coordination.

The paper tackles the problem of generating synchronized audio for movie scenes by introducing a multi-agent framework where a Sound Director agent coordinates on-screen and off-screen sound generation through multimodal LLM conversations, resulting in rich and relevant audio content for video clips.

Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on-screen" sounds require temporally-aligned audio generation, while "off-screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi-turn conversations with other agents for on-screen and off-screen sound generation through multimodal LLM. To address on-screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time-varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross-attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes