SDASDec 14, 2021

End-to-end speaker diarization with transformer

arXiv:2112.07463v1
Originality Incremental advance
AI Analysis

This addresses speaker diarization for multimedia and meeting applications, presenting an incremental approach by adapting a segmentation method to speech.

The paper tackles speaker diarization by proposing an end-to-end model called DiFormer that predicts binary masks, vocal activities, and speaker vectors to handle unknown speakers, overlaps, and vocal detection. Experiments on multimedia and meeting datasets show its effectiveness, though no concrete numbers are provided.

Speaker diarization is connected to semantic segmentation in computer vision. Inspired from MaskFormer \cite{cheng2021per} which treats semantic segmentation as a set-prediction problem, we propose an end-to-end approach to predict a set of targets consisting of binary masks, vocal activities and speaker vectors. Our model, which we coin \textit{DiFormer}, is mainly based on a speaker encoder and a feature pyramid network (FPN) module to extract multi-scale speaker features which are then fed into a transformer encoder-decoder to predict a set of diarization targets from learned query embedding. To account for temporal characteristics of speech signal, bidirectional LSTMs are inserted into the mask prediction module to improve temporal consistency. Our model handles unknown number of speakers, speech overlaps, as well as vocal activity detection in a unified way. Experiments on multimedia and meeting datasets demonstrate the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes