CLMay 20

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

arXiv:2605.2094651.2

AI Analysis

For AI speech systems, this method enables real-time reasoning during speech generation, improving both accuracy and naturalness over prior approaches.

The paper introduces InterRS, a method for real-time speech generation that interleaves reasoning steps with speech, achieving 13% better performance on math and logic benchmarks while maintaining fluent, instant responses.

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

View on arXiv PDF

Similar