CLSep 18, 2025

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens

arXiv:2509.14882v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses speech generation for applications requiring high acoustic fidelity, though it is incremental in improving existing token-based methods.

The paper tackles the problem of jointly modeling semantic and acoustic tokens in speech generation by proposing Llama-Mimi, a speech language model that achieves state-of-the-art performance in acoustic consistency and speaker identity preservation.

We propose Llama-Mimi, a speech language model that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting the inherent challenge of maintaining long-term coherence. We additionally introduce an LLM-as-a-Judge-based evaluation to assess the spoken content quality of generated outputs. Our models, code, and speech samples are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes