CLAIMay 28

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

arXiv:2605.3025178.9
AI Analysis

Addresses multi-turn inconsistency in LLMs, a practical problem for conversational AI; method is novel and shows strong cross-domain generalization.

LLMs fail when instructions are spread across turns due to self-anchored drift. CCOPD aligns multi-turn student behavior with full-context teacher, achieving 32% average relative improvement in raw-sharded performance across math and five zero-shot out-of-domain tasks while preserving full-context accuracy.

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes