CLAILGSDJan 30

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

arXiv:2601.22889v1h-index: 9
Originality Highly original
AI Analysis

This addresses the issue of uncorrectable errors in speech generation for users of speech AI systems, presenting a novel paradigm rather than an incremental improvement.

The paper tackles the problem of speech language models generating errors without explicit reasoning by introducing a 'Silent Thought, Spoken Answer' paradigm, where internal text reasoning is generated alongside spoken responses, resulting in state-of-the-art speech-to-speech QA accuracy with up to 9 points improvement over baselines and high TTS quality (6.2% WER).

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes