SDLGASOct 1, 2019

Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks

arXiv:1910.01463v22 citations
AI Analysis

This work addresses speaker recognition for applications with limited data, but it is incremental as it applies an existing neural network method to a specific dataset.

The paper tackled speaker recognition on a sparse dataset with only 3 samples per speaker, using Triplet Neural Networks to build a latent space, resulting in a 23% improvement over the baseline when using full training data and 46% better performance in multi-target speaker identification with reduced data.

We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, the $i$-vector representation with probabilistic linear discriminant analysis (PLDA) is the most commonly used technique to solve this problem, due to high classification accuracy with a relatively short computation time. In this paper, we explore a neural network approach, namely Triplet Neural Networks (TNNs), to built a latent space for different classifiers to solve the Multi-Target Speaker Detection and Identification Challenge Evaluation 2018 (MCE 2018) dataset. This training set contains $i$-vectors from 3,631 speakers, with only 3 samples for each speaker, thus making speaker recognition a challenging task. When using the train and development set for training both the TNN and baseline model (i.e., similarity evaluation directly on the $i$-vector representation), our proposed model outperforms the baseline by 23%. When reducing the training data to only using the train set, our method results in 309 confusions for the Multi-target speaker identification task, which is 46% better than the baseline model. These results show that the representational power of TNNs is especially evident when training on small datasets with few instances available per class.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes