ASCLMar 30, 2022

Multi-scale Speaker Diarization with Dynamic Scale Weighting

arXiv:2203.15974v129 citations
Originality Incremental advance
AI Analysis

This work addresses speaker diarization challenges for audio processing applications, offering incremental improvements through advanced multi-scale techniques.

The paper tackled the trade-off between temporal resolution and speaker representation fidelity in speaker diarization by proposing a multi-scale system with dynamic scale weighting, achieving state-of-the-art diarization error rates of 3.92% on CALLHOME and 1.05% on AMI MixHeadset datasets.

Speaker diarization systems are challenged by a trade-off between the temporal resolution and the fidelity of the speaker representation. By obtaining a superior temporal resolution with an enhanced accuracy, a multi-scale approach is a way to cope with such a trade-off. In this paper, we propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder. There are two main contributions in this study that significantly improve the diarization performance. First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale. Next, we propose the use of 1-D convolutional neural networks that dynamically determine the importance of each scale at each time step. To handle a variable number of speakers and overlapping speech, the proposed system can estimate the number of existing speakers. Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes