SDAIASFeb 9, 2021

On permutation invariant training for speech source separation

arXiv:2102.04945v28 citations
AI Analysis

This work addresses the permutation ambiguity problem in speech source separation for speaker-independent models, offering incremental improvements to existing PIT strategies.

This paper investigates permutation invariant training (PIT) for speaker-independent speech source separation, extending two state-of-the-art PIT strategies. The authors adapt a frame-level PIT (tPIT) and clustering algorithm to work with waveforms and a learned latent space, proposing an efficient clustering loss. They also extend an auxiliary speaker-ID loss with a deep feature loss to reduce local permutation errors in utterance-level PIT (uPIT).

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes