AIApr 11, 2022

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

arXiv:2204.05148v29 citationsh-index: 37
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving speech processing tasks like query-by-example and spoken term discovery for multilingual applications, representing an incremental advancement by building on existing self-supervised audio representations.

The paper tackles the problem of learning speech sequence embeddings by introducing a neural encoder trained with an unsupervised contrastive learning objective using data-augmented k-nearest neighbors for positive samples, achieving state-of-the-art results on query-by-example and spoken term discovery tasks across five languages with significant performance margins.

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-by-example task on the LibriSpeech dataset to monitor future improvements in the field.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes