SDMay 23

Diffusion-based Frameworks for Unsupervised Speech Enhancement

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda

arXiv:2601.0993148.52 citationsh-index: 33

Predicted impact top 59% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in speech enhancement, this work provides a novel unsupervised approach that outperforms existing unsupervised methods and is competitive with supervised baselines under mismatched conditions.

This paper improves unsupervised speech enhancement by jointly modeling speech and noise as latent variables in a diffusion-based framework, achieving state-of-the-art quality and intelligibility among unsupervised methods on WSJ0-QUT and VoiceBank-DEMAND datasets.

This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new semi-supervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Code, demo, and supplementary materials are publicly available.

View on arXiv PDF

Similar