ASAILGSDJul 22, 2018

Unified Hypersphere Embedding for Speaker Recognition

arXiv:1807.08312v188 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy challenges in speaker recognition for applications like security and voice assistants, but it is incremental as it builds on existing methods with specific optimizations.

The paper tackles the problem of improving speaker recognition accuracy without extra data or deeper models by using data augmentation, optimal embedding dimensionality, and a new loss function, achieving up to 18% error reduction and state-of-the-art identification with competitive verification on the VoxCeleb dataset.

Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes