ASLGSep 17, 2025

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

arXiv:2509.14379v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses speech separation for audio processing applications in noisy settings, representing an incremental improvement through direct noise modelling.

The paper tackles single-microphone speech separation in noisy environments by proposing a generative unsupervised technique that models clean speech and structured noise components, achieving promising performance in challenging acoustic conditions.

In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the speech distribution, we enable effective decomposition through the inverse problem paradigm. We perform speech separation by sampling from the posterior distributions via a reverse diffusion process, which directly estimates and removes the modelled noise component to recover clean constituent signals. Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach in challenging acoustic environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes