CVAISDASJun 13, 2024

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

arXiv:2406.09272v324 citations
Originality Highly original
AI Analysis

This work solves the problem of generating faithful action sounds from videos for applications like film sound effects or virtual reality games, though it is incremental as it builds on existing video-to-audio generation methods with a novel conditioning mechanism.

The paper tackles the problem of generating realistic audio for human actions from egocentric videos by addressing the issue of uncontrolled ambient sounds or hallucinations due to weak correspondence between video and audio in training data. The proposed AV-LDM model outperforms existing methods, allows controllable ambient sound generation, and shows promise for generalization to computer graphics clips.

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes