CVJan 21

SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

arXiv:2601.15017v1h-index: 1

Originality Incremental advance

AI Analysis

This work addresses the need for immersive audio in video-to-audio generation, offering a domain-specific improvement for applications like virtual reality and multimedia.

The paper tackles the problem of generating spatial audio from video, which lacks spatial fidelity in existing methods due to reliance on mono audio datasets, and introduces a new dataset and framework that substantially outperforms state-of-the-art models in spatial fidelity while maintaining semantic and temporal alignment.

While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.

View on arXiv PDF

Similar