LG AIDec 16, 2025

Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da

arXiv:2512.14895v15 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses training inefficiencies for multi-turn LM agents in domains like software engineering, though it is an incremental improvement over existing methods like DAgger.

The paper tackles the problem of covariate shift in imitation learning for multi-turn language model agents by proposing on-policy expert corrections (OECs), a data generation method that combines student rollouts with expert interventions. Results show a 14% and 13% relative improvement over traditional imitation learning on software engineering tasks for 7b and 32b models, respectively.

A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.

View on arXiv PDF

Similar