CVMMSDASJul 8, 2024

Read, Watch and Scream! Sound Generation from Text and Video

arXiv:2407.05551v250 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of limited performance and flexibility in video-to-audio generation for multimedia applications, representing an incremental advance by combining existing approaches.

The paper tackles the challenge of generating audio from video and text by proposing a method that uses video as conditional control for a text-to-audio model, achieving improved quality, controllability, and training efficiency.

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called \ours, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Code and demo are available at https://naver-ai.github.io/rewas.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes