SD CL ASNov 5, 2018

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Hossein Zeinali, Lukas Burget, Johan Rohdin, Themos Stafylakis, Jan Cernocky

arXiv:1811.02066v116.251 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of making speaker verification methods more accessible and potentially improvable for researchers and practitioners by moving beyond specialized toolkits, though it is incremental in nature.

The paper tackled improving speaker embeddings extractors by implementing them in a generic toolkit beyond Kaldi, exploring training tricks like normalization and overfitting prevention, and comparing architectures like TDNN vs. CNN and attention mechanisms, with experimental results on datasets such as Speaker in the Wild, SRE 2016, and SRE 2018 showing effectiveness.

Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative non-linearities that can be used instead of Rectifier Linear Units. In addition, we investigate the difference in performance between TDNN and CNN, and between two types of attention mechanism. Experimental results on Speaker in the Wild, SRE 2016 and SRE 2018 datasets demonstrate the effectiveness of the proposed implementation.

View on arXiv PDF

Similar