SDCVLGROASJun 1, 2022

Towards Generalisable Audio Representations for Audio-Visual Navigation

arXiv:2206.00393v11 citationsh-index: 51
Originality Incremental advance
AI Analysis

This addresses the challenge of generalizing to new audio signals in navigation tasks for AI agents, though it is incremental as it builds on existing frameworks.

The paper tackles the problem of improving model generalization on unheard sounds in audio-visual navigation by proposing a contrastive learning-based method with data augmentation, resulting in performance gains of 13.4% and 12.2% in SPL on Replica and MP3D datasets.

In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes