ASCLJan 29

Sylber 2.0: A Universal Syllable Embedding

arXiv:2601.22306v11 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the need for efficient and universal speech tokens in spoken language modeling, offering improvements for tasks like TTS and low-resource ASR, though it builds incrementally on prior syllable-based approaches.

The paper tackles the problem of scaling spoken language modeling by proposing Sylber 2.0, a self-supervised framework for syllable-level speech coding that achieves a low token frequency of around 5 Hz and enables efficient TTS modeling with 72M parameters, performing on par with high-frequency baselines.

Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes