Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
This addresses a limitation in audio-visual sound separation for applications like video editing or surveillance by extending capabilities to include sounds beyond the camera's view, though it is incremental as it builds on existing audio-visual methods.
The paper tackled the problem of separating invisible sounds in audio-visual scenes, where current methods fail due to lack of visible cues, and introduced the AVSA-Sep framework, which successfully separates both visible and invisible sounds through semantic parsing and scene-informed separation.
The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.