SDASFeb 19, 2021

Speech enhancement with weakly labelled data from AudioSet

arXiv:2102.09971v119 citations
Originality Incremental advance
AI Analysis

This addresses the problem of data scarcity in speech enhancement for researchers and practitioners by leveraging weakly labelled datasets, though it is incremental as it builds on existing neural network and source separation techniques.

The paper tackles speech enhancement by proposing a framework trained with weakly labelled data from AudioSet, eliminating the need for noisy-clean speech pairs, and achieves a PESQ of 2.28 and SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming prior methods.

Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. Weakly labelled data only contain audio tags of audio clips, but not the onset or offset times of speech. We first apply pretrained audio neural networks (PANNs) to detect anchor segments that contain speech or sound events in audio clips. Then, we randomly mix two detected anchor segments containing speech and sound events as a mixture, and build a conditional source separation network using PANNs predictions as soft conditions for speech enhancement. In inference, we input a noisy speech signal with the one-hot encoding of "Speech" as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes