CVSDASOct 18, 2023

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

arXiv:2310.11713v15 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses a limitation in audio-visual sound separation for applications like video editing or surveillance by extending capabilities to include sounds beyond the camera's view, though it is incremental as it builds on existing audio-visual methods.

The paper tackled the problem of separating invisible sounds in audio-visual scenes, where current methods fail due to lack of visible cues, and introduced the AVSA-Sep framework, which successfully separates both visible and invisible sounds through semantic parsing and scene-informed separation.

The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes