Multi-channel Speech Separation Using Deep Embedding Model with Multilayer Bootstrap Networks
This work addresses speech separation for applications in noisy, real-world settings, but it is incremental as it builds on existing deep clustering methods.
The paper tackled the problem of speaker-independent speech separation in mismatched and reverberant environments by proposing DPCL++, a variant of deep clustering that uses multilayer bootstrap networks to reduce noise in embedding vectors and incorporates spatial features. The method demonstrated effectiveness in experiments, though no specific numerical results were provided.
Recently, deep clustering (DPCL) based speaker-independent speech separation has drawn much attention, since it needs little speaker prior information. However, it still has much room of improvement, particularly in reverberant environments. If the training and test environments mismatch which is a common case, the embedding vectors produced by DPCL may contain much noise and many small variations. To deal with the problem, we propose a variant of DPCL, named DPCL++, by applying a recent unsupervised deep learning method---multilayer bootstrap networks(MBN)---to further reduce the noise and small variations of the embedding vectors in an unsupervised way in the test stage, which fascinates k-means to produce a good result. MBN builds a gradually narrowed network from bottom-up via a stack of k-centroids clustering ensembles, where the k-centroids clusterings are trained independently by random sampling and one-nearest-neighbor optimization. To further improve the robustness of DPCL++ in reverberant environments, we take spatial features as part of its input. Experimental results demonstrate the effectiveness of the proposed method.