CLMay 20

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

arXiv:2605.2094651.2
AI Analysis

For AI speech systems, this method enables real-time reasoning during speech generation, improving both accuracy and naturalness over prior approaches.

The paper introduces InterRS, a method for real-time speech generation that interleaves reasoning steps with speech, achieving 13% better performance on math and logic benchmarks while maintaining fluent, instant responses.

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes