Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
This addresses the challenge of generating realistic audio from videos for applications in media production, though it is incremental as it builds on existing V2A methods with a novel guidance approach.
The paper tackled the problem of video-to-audio synthesis by proposing a step-by-step method for finer controllability and more realistic audio generation, resulting in improved separability of sounds and overall audio quality that outperformed existing baselines.
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.