SDLGASNov 28, 2021

Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information

arXiv:2111.13694v11 citations
Originality Incremental advance
AI Analysis

This work addresses speaker diarization for flexible numbers of speakers, integrating textual information for improved accuracy in applications like meeting transcription, though it appears incremental as it builds on existing embedding and encoding techniques.

The paper tackles overlapping speech diarization by reformulating it as a single-label prediction problem using power set encoding and proposes the SEND method, which leverages speaker embeddings and textual information to reduce diarization errors, achieving a 34.11% relative improvement over a baseline in real meeting scenarios.

Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes