EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
This work addresses the problem of fine-grained controllable sound generation for videos, which is important for multimedia creators, but it is incremental as it builds on existing video-text-to-audio methods.
The paper tackled limitations in video-text-to-audio generation, such as visual dominance and weak controllability, by introducing EchoFoley, a task for video-grounded sound generation with event-level control, and achieved improvements of 40.7% in controllability and 12.5% in perceptual quality over existing models.
Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.