CL IT LGOct 24, 2025

Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui

arXiv:2510.21339v22.7h-index: 14

Originality Incremental advance

AI Analysis

This work addresses the mismatch between training and deployment for LLMs in reasoning tasks, showing that multi-turn training is not beneficial and can be harmful, which is incremental as it challenges prior assumptions.

The study investigated whether multi-turn training with human feedback improves LLM reasoning, finding that single-turn training generalizes better to both single- and multi-turn evaluations, while multi-turn strategies degrade single-turn performance.

The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.

View on arXiv PDF

Similar