Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation
This addresses speech separation for audio processing applications, offering a semi-supervised approach that reduces the need for large labeled datasets, though it is incremental over existing MixIT methods.
The paper tackles the problem of unsupervised and semi-supervised speech separation by introducing a teacher-student framework using MixIT and PIT, which resolves over-separation issues and achieves performance comparable to fully-supervised systems with ten times less supervised data.
In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.