SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision
This addresses the lack of spatial audio in consumer videos, offering a modular and generic solution without task-specific annotations, though it appears incremental as it builds on existing audio-visual methods.
The paper tackles the problem of converting monaural audio to binaural audio using visual guidance, introducing SIREN, which achieves consistent gains on time-frequency and phase-sensitive metrics with competitive SNR on datasets like FAIR-Play and MUSIC-Stereo.
Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.