CLLGApr 8

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

arXiv:2604.0746734.0h-index: 3
AI Analysis

This addresses a limitation in speech representation learning for tasks where prosody matters, such as text-to-speech, but is incremental as it builds on existing quantisation methods.

The paper tackles the problem that discrete speech units (DSUs) from self-supervised learning encode suprasegmental features like lexical tone less reliably than segmental structure, as demonstrated in Mandarin and Yorùbá, and proposes a two-step K-means clustering method that improves tone encoding.

Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes