SDCVLGMMASAug 21, 2024

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

arXiv:2408.11915v321 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work solves the labor-intensive task of synchronizing audio with video for multimedia production, offering an incremental improvement over prior methods by eliminating the need for costly human annotations.

The paper tackles the problem of automating Foley sound synthesis from video by addressing poor alignment and controllability in existing systems, proposing Video-Foley, which uses Root Mean Square as a temporal event condition and achieves state-of-the-art performance in audio-visual alignment and controllability.

Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor alignment and controllability, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope closely related to audio semantics, acts as a temporal event feature to guide audio generation from video. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Source code, model weights and demos are available on our companion website. (https://jnwnlee.github.io/video-foley-demo)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes