CVSDASJul 31, 2023

High-Quality Visually-Guided Sound Separation from Diverse Categories

arXiv:2308.00122v215 citationsh-index: 39
Originality Incremental advance
AI Analysis

This addresses the problem of high-quality sound separation from diverse categories for applications in audio processing, though it is incremental as it builds on existing generative methods.

The paper tackles audio-visual sound source separation by proposing DAVIS, a diffusion-based generative framework that synthesizes separated sounds directly from noise, conditioned on audio and visual inputs, and it outperforms state-of-the-art discriminative methods on AVE and MUSIC datasets.

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes