CLMar 29, 2022

Autoregressive Co-Training for Learning Discrete Speech Representations

arXiv:2203.15840v28 citations
AI Analysis

This work addresses the need for clearer relationships and improved performance in self-supervised discrete speech representation learning, though it appears incremental by building on and comparing to existing methods.

The paper tackles the problem of learning discrete speech representations by proposing a generative model with discrete latent variables optimized via information-theoretic co-training, which subsumes existing methods like HuBERT and vector quantization and empirically shows higher correlation with phonetic units.

While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other. In this paper, we consider a generative model with discrete latent variables that learns a discrete representation for speech. The objective of learning the generative model is formulated as information-theoretic co-training. Besides the wide generality, the objective can be optimized with several approaches, subsuming HuBERT-like training and vector quantization for learning discrete representation. Empirically, we find that the proposed approach learns discrete representation that is highly correlated with phonetic units, more correlated than HuBERT-like training and vector quantization.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes