SDCLASOct 22, 2020

Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers

arXiv:2010.11803v21 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately identifying and diarizing speakers in scenarios with simultaneous speech, which is incremental as it builds on existing embedding methods.

The paper tackles the problem of speaker diarization with overlapping speech from multiple speakers by proposing a compositional embedding method, which outperforms traditional methods in multi-person speaker identification and achieves state-of-the-art accuracy with a DER of 22.93% on the AMI Headset Mix corpus.

We propose a new method for speaker diarization that can handle overlapping speech with 2+ people. Our method is based on compositional embeddings [1]: Like standard speaker embedding methods such as x-vector [2], compositional embedding models contain a function f that separates speech from different speakers. In addition, they include a composition function g to compute set-union operations in the embedding space so as to infer the set of speakers within the input audio. In an experiment on multi-person speaker identification using synthesized LibriSpeech data, the proposed method outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets). In a speaker diarization experiment on the AMI Headset Mix corpus, we achieve state-of-the-art accuracy (DER=22.93%), slightly higher than the previous best result (23.82% from [3]).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes