CLLGOct 12, 2019

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

arXiv:1910.05453v3731 citations
Originality Highly original
AI Analysis

This work addresses the challenge of applying NLP techniques to speech processing by providing a method to convert continuous audio into discrete tokens, which is incremental as it builds on existing wav2vec and BERT frameworks.

The paper tackled the problem of learning discrete speech representations from audio by proposing vq-wav2vec, which uses self-supervised context prediction and quantization methods like gumbel softmax or online k-means, enabling NLP algorithms to be applied directly; the result was that BERT pre-training with these representations achieved a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes