ASSDOct 22, 2020

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

arXiv:2010.11860v140 citations
Originality Incremental advance
AI Analysis

This work addresses speech denoising for applications requiring high perceptual quality, but it is incremental as it builds on existing perceptual loss ideas and focuses on optimizing specific models.

The paper tackles the problem of improving perceptual quality in speech denoising by introducing a perceptual loss framework (PERL) that uses pre-trained models to discourage distortion, and finds that using an acoustic event model (PERL-AE) outperforms state-of-the-art methods on perceptual metrics.

Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes