Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models
This work addresses speech denoising for applications requiring high perceptual quality, but it is incremental as it builds on existing perceptual loss ideas and focuses on optimizing specific models.
The paper tackles the problem of improving perceptual quality in speech denoising by introducing a perceptual loss framework (PERL) that uses pre-trained models to discourage distortion, and finds that using an acoustic event model (PERL-AE) outperforms state-of-the-art methods on perceptual metrics.
Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.