CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
This work improves audio generation for immersive media applications, but it is incremental as it builds on existing methods with specific enhancements.
The paper tackles the problem of binaural audio generation from monaural audio using visual prompts, addressing overfitting to room environments and loss of spatial details, and achieves state-of-the-art accuracy on FAIR-Play and MUSIC-Stereo benchmarks.
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.