Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
For AI speech systems, this method enables real-time reasoning during speech generation, improving both accuracy and naturalness over prior approaches.
The paper introduces InterRS, a method for real-time speech generation that interleaves reasoning steps with speech, achieving 13% better performance on math and logic benchmarks while maintaining fluent, instant responses.
The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.