CLJun 4, 2023

An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech

arXiv:2306.02405v118 citationsh-index: 35
Originality Synthesis-oriented
AI Analysis

This provides insights into speech representation learning for researchers, but is incremental as it analyzes existing models without introducing new methods.

The paper tackles the problem of characterizing the relationship between discrete units from self-supervised speech models and phonetic categories by developing an information-theoretic framework, and finds that phonetic distributions reflect speech sound variability with similar sounds having similar distributions, though no direct one-to-one correspondence exists.

Self-supervised representation learning for speech often involves a quantization step that transforms the acoustic input into discrete units. However, it remains unclear how to characterize the relationship between these discrete units and abstract phonetic categories such as phonemes. In this paper, we develop an information-theoretic framework whereby we represent each phonetic category as a distribution over discrete units. We then apply our framework to two different self-supervised models (namely wav2vec 2.0 and XLSR) and use American English speech as a case study. Our study demonstrates that the entropy of phonetic distributions reflects the variability of the underlying speech sounds, with phonetically similar sounds exhibiting similar distributions. While our study confirms the lack of direct, one-to-one correspondence, we find an intriguing, indirect relationship between phonetic categories and discrete units.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes