ASCLLGMLAug 4, 2018

Triplet Network with Attention for Speaker Diarization

arXiv:1808.01535v118 citations
Originality Incremental advance
AI Analysis

This work addresses speaker diarization for automatic speech processing systems, presenting an incremental improvement over prior methods.

The paper tackled speaker diarization by proposing a triplet network with attention to learn embeddings and metrics end-to-end from sequences, achieving improved performance on the CALLHOME corpus compared to existing methods.

In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes