SD CV LG MM ASOct 23, 2023

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

Marco Comunità, Riccardo F. Gramaccioni, Emilian Postolache, Emanuele Rodolà, Danilo Comminiello, Joshua D. Reiss

arXiv:2310.15247v120.731 citationsh-index: 26

Originality Incremental advance

AI Analysis

This addresses the time-consuming synchronization task for sound designers in media like cinema, video games, and animations, though it is incremental as it builds on existing diffusion models and multimodal techniques.

The paper tackles the problem of synchronizing sound effects with video in sound design by proposing a system that extracts repetitive action onsets from video and uses them with audio or textual embeddings to condition a diffusion model for generating synchronized audio tracks, reducing the burden of manual synchronization.

Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility

View on arXiv PDF

Similar