SD CL ASNov 5, 2020

BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers

arXiv:2011.02678v212.447 citations

Originality Incremental advance

AI Analysis

This addresses the need for real-time speaker diarization in applications like teleconferencing, though it is incremental as it builds on an existing architecture.

The paper tackles the problem of online speaker diarization for a variable number of speakers by proposing BW-EDA-EEND, a streaming end-to-end neural system that processes data incrementally with linear time complexity. It shows moderate degradation for up to two speakers compared to offline methods, outperforms baseline offline clustering for one to four speakers, and achieves comparable accuracy in limited-latency settings.

We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. We propose two variants: For unlimited-latency BW-EDA-EEND, which processes inputs in linear time, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system.

View on arXiv PDF

Similar