CVGRSDASFeb 4, 2023

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

arXiv:2302.02088v369 citationsh-index: 39
Originality Incremental advance
AI Analysis

This addresses the challenge of creating immersive audio-visual experiences for applications like VR/AR, though it is incremental as it builds on existing NeRF techniques.

The paper tackles the problem of synthesizing realistic audio-visual scenes from video recordings by proposing AV-NeRF, a NeRF-based method that generates new videos with spatial audio along arbitrary camera trajectories, achieving improved performance on real-world and simulation datasets.

Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes