ASCLMay 28

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

arXiv:2605.2985973.6
Predicted impact top 43% in AS · last 90 daysOriginality Incremental advance
AI Analysis

For speech language modeling, this work addresses the suboptimality of separately optimized encoders by enabling joint optimization, leading to better performance and fewer artifacts.

MELD introduces a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and speech language model, achieving improvements over baselines on zero-shot TTS and STT tasks while alleviating issues like prolonged silence and word omissions.

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes