AS CLMay 28

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng, Hao Tang, Jay Mahadeokar, Ozlem Kalinli, Alexandre Mourachko

arXiv:2605.2985973.6

Predicted impact top 43% in AS · last 90 daysOriginality Incremental advance

AI Analysis

For speech language modeling, this work addresses the suboptimality of separately optimized encoders by enabling joint optimization, leading to better performance and fewer artifacts.

MELD introduces a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and speech language model, achieving improvements over baselines on zero-shot TTS and STT tasks while alleviating issues like prolonged silence and word omissions.

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

View on arXiv PDF

Similar