SDAILGMMFeb 13, 2015

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

arXiv:1502.04149v4466 citations
AI Analysis

This work addresses source separation for audio applications like speech and music processing, presenting an incremental improvement over existing methods.

The paper tackled monaural source separation by jointly optimizing masking functions and deep recurrent neural networks, achieving gains of 2.30-4.98 dB SDR in speech separation, 2.30-2.48 dB GNSDR in singing voice separation, and outperforming baselines in speech denoising.

Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30--4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30--2.48 dB GNSDR gain and 4.32--5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes