CLITLGOct 24, 2025

Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

arXiv:2510.21339v2h-index: 14
Originality Incremental advance
AI Analysis

This work addresses the mismatch between training and deployment for LLMs in reasoning tasks, showing that multi-turn training is not beneficial and can be harmful, which is incremental as it challenges prior assumptions.

The study investigated whether multi-turn training with human feedback improves LLM reasoning, finding that single-turn training generalizes better to both single- and multi-turn evaluations, while multi-turn strategies degrade single-turn performance.

The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes