SDAIASJan 29, 2024

Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

arXiv:2401.17129v110 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses audio-visual scene understanding for applications like robotics or surveillance, but is incremental as it builds directly on prior models.

The researchers tackled sound event localization and detection in 360-degree audio-visual environments by adapting an audio-only model to incorporate video data, achieving performance improvements over the existing audio-visual baseline.

This technical report details our work towards building an enhanced audio-visual sound event localization and detection (SELD) network. We build on top of the audio-only SELDnet23 model and adapt it to be audio-visual by merging both audio and video information prior to the gated recurrent unit (GRU) of the audio-only network. Our model leverages YOLO and DETIC object detectors. We also build a framework that implements audio-visual data augmentation and audio-visual synthetic data generation. We deliver an audio-visual SELDnet system that outperforms the existing audio-visual SELD baseline.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes