ASApr 7, 2020
SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech EnhancementRobert Rehr, Timo Gerkmann
In this paper, we address the generalization of deep neural network (DNN) based speech enhancement to unseen noise conditions for the case that training data is limited in size and diversity. To gain more insights, we analyze the generalization with respect to (1) the size and diversity of the training data, (2) different network architectures, and (3) the chosen features. To address (1), we train networks on the Hu noise corpus (limited size), the CHiME 3 noise corpus (limited diversity) and also propose a large and diverse dataset collected based on freely available sounds. To address (2), we compare a fully-connected feed-forward and a long short-term memory (LSTM) architecture. To address (3), we compare three input features, namely logarithmized noisy periodograms, noise aware training (NAT) and the proposed signal-to-noise ratio (SNR) based noise aware training (SNR-NAT). We confirm that rich training data and improved network architectures help DNNs to generalize. Furthermore, we show via experimental results and an analysis using t-distributed stochastic neighbor embedding (t-SNE) that the proposed SNR-NAT features yield robust and level independent results in unseen noise even with simple network architectures and when trained on only small datasets, which is the key contribution of this paper.
SDSep 7, 2017
Normalized Features for Improving the Generalization of DNN Based Speech EnhancementRobert Rehr, Timo Gerkmann
Enhancing noisy speech is an important task to restore its quality and to improve its intelligibility. In traditional non-machine-learning (ML) based approaches the parameters required for noise reduction are estimated blindly from the noisy observation while the actual filter functions are derived analytically based on statistical assumptions. Even though such approaches generalize well to many different acoustic conditions, the noise suppression capability in transient noises is low. To amend this shortcoming, machine-learning (ML) methods such as deep learning have been employed for speech enhancement. However, due to their data-driven nature, the generalization of ML based approaches to unknown noise types is still discussed. To improve the generalization of ML based algorithms and to enhance the noise suppression of non-ML based methods, we propose a combination of both approaches. For this, we employ the a priori signal-to-noise ratio (SNR) and the a posteriori SNR estimated as input features in a deep neural network (DNN) based enhancement scheme. We show that this approach allows ML based speech estimators to generalize quickly to unknown noise types even if only few noise conditions have been seen during training. Further, the proposed features outperform a competing approach where an estimate of the noise power spectral density is appended to the noisy spectra. Instrumental measures such as Perceptual Evaluation of Speech Quality (PESQ) and short-time objective intelligibility (STOI) indicate strong improvements in unseen conditions when the proposed features are used. Listening experiments confirm the improved generalization of our proposed combination.
SDMar 15, 2017
On the Importance of Super-Gaussian Speech Priors for Machine-Learning Based Speech EnhancementRobert Rehr, Timo Gerkmann
For enhancing noisy signals, machine-learning based single-channel speech enhancement schemes exploit prior knowledge about typical speech spectral structures. To ensure a good generalization and to meet requirements in terms of computational complexity and memory consumption, certain methods restrict themselves to learning speech spectral envelopes. We refer to these approaches as machine-learning spectral envelope (MLSE)-based approaches. In this paper we show by means of theoretical and experimental analyses that for MLSE-based approaches, super-Gaussian priors allow for a reduction of noise between speech spectral harmonics which is not achievable using Gaussian estimators such as the Wiener filter. For the evaluation, we use a deep neural network (DNN)-based phoneme classifier and a low-rank nonnegative matrix factorization (NMF) framework as examples of MLSE-based approaches. A listening experiment and instrumental measures confirm that while super-Gaussian priors yield only moderate improvements for classic enhancement schemes, for MLSE-based approaches super-Gaussian priors clearly make an important difference and significantly outperform Gaussian priors.