ASSDAug 27, 2020

Estimating Uniqueness of I-Vector Representation of Human Voice

arXiv:2008.11985v2
AI Analysis

This work addresses the need for quantifying voice individuality in biometrics, which is incremental as it builds on existing i-vector models with new datasets and a tailored uniqueness measure.

The paper tackled the problem of measuring the uniqueness of human voice using i-vector representations, finding that distinctiveness can reach 43-70 bits for 5-second samples but reduces by about 30 bits under less constrained speech variations, and doubling sample duration increases distinctiveness by around 20 bits.

We study the individuality of the human voice with respect to a widely used feature representation of speech utterances, namely, the i-vector model. As a first step toward this goal, we compare and contrast uniqueness measures proposed for different biometric modalities. Then, we introduce a new uniqueness measure that evaluates the entropy of i-vectors while taking into account speaker level variations. Our measure operates in the discrete feature space and relies on accurate estimation of the distribution of i-vectors. Therefore, i-vectors are quantized while ensuring that both the quantized and original representations yield similar speaker verification performance. Uniqueness estimates are obtained from two newly generated datasets and the public VoxCeleb dataset. The first custom dataset contains more than one and a half million speech samples of 20,741 speakers obtained from TEDx Talks videos. The second one includes over twenty one thousand speech samples from 1,595 actors that are extracted from movie dialogues. Using this data, we analyzed how several factors, such as the number of speakers, number of samples per speaker, sample durations, and diversity of utterances affect uniqueness estimates. Most notably, we determine that the discretization of i-vectors does not cause a reduction in speaker recognition performance. Our results show that the degree of distinctiveness offered by i-vector-based representation may reach 43-70 bits considering 5-second long speech samples; however, under less constrained variations in speech, uniqueness estimates are found to reduce by around 30 bits. We also find that doubling the sample duration increases the distinctiveness of the i-vector representation by around 20 bits.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes