Unsupervised training of a deep clustering model for multichannel blind source separation
This work addresses the limitation of requiring strong supervision in deep clustering for multichannel blind source separation, offering an incremental improvement for audio processing applications.
The paper tackles the problem of training neural network-based source separation without parallel clean data by using unsupervised spatial clustering to guide deep clustering training, achieving a 26% relative word error rate reduction over the unsupervised baseline.
We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitation. We demonstrate that (a) the single-channel deep clustering system trained according to the proposed scheme alone is able to achieve a similar performance as the multi-channel teacher in terms of word error rates and (b) initializing the spatial clustering approach with the deep clustering result yields a relative word error rate reduction of 26 % over the unsupervised teacher.