CLASJul 27, 2025

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

arXiv:2507.20091v21 citationsh-index: 44
Originality Incremental advance
AI Analysis

This addresses a key limitation in speech AI for applications like natural speech generation and understanding, though it is an incremental improvement over existing tokenization methods.

The paper tackles the problem of speech language models failing to capture prosody information effectively, and finds that ProsodyLM, which uses a novel tokenization scheme, learns diverse prosody processing capabilities through pre-training alone, such as handling contrastive focus and emotion.

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes