ASSDSPJan 14, 2020

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

arXiv:2001.04584v15 citations
AI Analysis

This work addresses speaker verification for security and authentication applications, but it is incremental as it builds upon existing x-vector embedding methods.

The paper tackled speaker verification by proposing an improved deep embedding method with multi-scale convolution and a Baum-Welch statistics attention mechanism, achieving performance gains on the NIST SRE16 evaluation set.

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes