CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

arXiv:2605.189167.8
Predicted impact top 60% in MM · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of generating audio that contradicts visual evidence for video foley generation, which is a novel and challenging task for VT2A models.

CounterFlow introduces a two-phase inference-time sampling method for pretrained flow-matching video-to-audio models that enables generating audio with a sound-source identity contradicting the visual evidence while maintaining temporal synchronization. The method significantly outperforms naive negative prompting and existing baselines in counterfactual video foley generation.

We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes