SDAIASDec 24, 2024

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

arXiv:2412.18157v12 citationsh-index: 16ICASSP
Originality Incremental advance
AI Analysis

This work addresses the problem of producing realistic and temporally aligned audio for video-to-audio generation, which is incremental as it builds on pre-trained text-to-audio models with adapters for enhanced performance.

The paper tackles the challenge of generating continuous, synchronized Foley sound for videos with moving visual presence by proposing Smooth-Foley, a model that uses semantic guidance from textual labels to improve audio-video alignment, resulting in higher quality audio that better adheres to physical laws compared to existing models.

The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes