SDMMASApr 17

StereoFoley: Object-Aware Stereo Audio Generation from Video

arXiv:2509.1827221.32 citationsh-index: 16
Predicted impact top 32% in SD · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the lack of object-aware stereo audio in video-to-audio generation, a critical gap for immersive multimedia applications.

StereoFoley generates stereo audio from video with object-aware spatial accuracy, achieving semantic and temporal fidelity on par with state-of-the-art models. It introduces a synthetic data pipeline and a stereo object-awareness metric, validated by human listening studies.

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop a base model that generates stereo audio from video, achieving performance on par with state-of-the-art V2A models in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce a stereo object-awareness metric and report it alongside a human listening study; the two evaluations exhibit consistent trends. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap in the field.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes