CV AI SD ASJun 13, 2024

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

arXiv:2406.09272v319.424 citations

Originality Highly original

AI Analysis

This work solves the problem of generating faithful action sounds from videos for applications like film sound effects or virtual reality games, though it is incremental as it builds on existing video-to-audio generation methods with a novel conditioning mechanism.

The paper tackles the problem of generating realistic audio for human actions from egocentric videos by addressing the issue of uncontrolled ambient sounds or hallucinations due to weak correspondence between video and audio in training data. The proposed AV-LDM model outperforms existing methods, allows controllable ambient sound generation, and shows promise for generalization to computer graphics clips.

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds -- 1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

View on arXiv PDF

Similar