SDLGASMay 28, 2021

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

arXiv:2105.13802v113 citations
Originality Highly original
AI Analysis

This work addresses speaker diarization for applications like meeting transcription, presenting a novel end-to-end approach that sets a new state-of-the-art.

The authors tackled speaker diarization by introducing DIVE, an end-to-end neural algorithm that iteratively builds speaker representations and predicts voice activity, achieving a Diarization Error Rate of 6.7% on the CALLHOME benchmark, outperforming the previous best of 7.8%.

We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation. Overall, these contributions yield a system redefining the state-of-the-art on the standard CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes