ASAICLLGJun 26, 2024

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

arXiv:2406.18679v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of speaker diarization for long-form audio in applications like call analysis, offering improvements over existing methods but is incremental as it builds on EEND-vector-clustering.

The paper tackles the problem of generalizing end-to-end neural diarization (EEND) to long-form audio with many speakers by proposing a framework that applies EEND both locally and globally without separate speaker embeddings, achieving a 13% relative DER reduction on Callhome American English and 10% on RT03-CTS datasets compared to conventional methods.

End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes