CVMMSDASApr 17, 2023

Conditional Generation of Audio from Video via Foley Analogies

arXiv:2304.08490v167 citationsh-index: 44
Originality Incremental advance
AI Analysis

This addresses the challenge for video designers in creating artistic sound effects that align with on-screen actions but differ from real-world sounds, though it is incremental in applying conditional generation to Foley sound.

The paper tackles the problem of generating soundtracks for silent videos based on user-supplied audio examples, rather than true scene sounds, and demonstrates through human studies and automated metrics that their model successfully produces varied and matching audio outputs.

The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes