ASCLLGSDJul 27, 2020

Evaluating the reliability of acoustic speech embeddings

arXiv:2007.13542v232 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of a task-neutral evaluation method for speech embeddings, highlighting the need for better metrics in speech processing research.

The study systematically compared ABX discrimination and Mean Average Precision (MAP) metrics across 17 speech embedding methods on 5 languages, finding they correlate with each other and with a downstream frequency estimation task, but show substantial discrepancies in fine-grained distinctions.

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes