SDASSep 25, 2018

Attention Mechanism in Speaker Recognition: What Does It Learn in Deep Speaker Embedding?

arXiv:1809.09311v131 citations
Originality Incremental advance
AI Analysis

It addresses improving speaker recognition accuracy for applications like security and voice assistants, but is incremental as it adapts existing attention methods.

This paper investigates decoupling an attention mechanism from its original deep speaker embedding network to assist other systems, showing a 9.0% EER reduction and 3.8% min_Cprimary reduction when applied to i-vector extraction, and further gains when combined with DNN-based VAD.

This paper presents an experimental study on deep speaker embedding with an attention mechanism that has been found to be a powerful representation learning technique in speaker recognition. In this framework, an attention model works as a frame selector that computes an attention weight for each frame-level feature vector, in accord with which an utterancelevel representation is produced at the pooling layer in a speaker embedding network. In general, an attention model is trained together with the speaker embedding network on a single objective function, and thus those two components are tightly bound to one another. In this paper, we consider the possibility that the attention model might be decoupled from its parent network and assist other speaker embedding networks and even conventional i-vector extractors. This possibility is demonstrated through a series of experiments on a NIST Speaker Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8% min_Cprimary reduction when the attention weights are applied to i-vector extraction. Another experiment shows that DNN-based soft voice activity detection (VAD) can be effectively combined with the attention mechanism to yield further reduction of minCprimary by 6.6% and 1.6% in deep speaker embedding and i-vector systems, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes