CLMay 31

On the Generalization Gap in Self-Evolving Language Model Reasoning

arXiv:2606.0107563.6
AI Analysis

For researchers developing self-improving LLMs, this work characterizes the limitations of closed-loop self-evolution, showing that internally generated supervision is insufficient to match oracle supervision under minimal formulations.

The paper investigates how close self-evolution (using only the model's own outputs) can get to oracle-supervised training for LLM reasoning. Experiments on logical reasoning tasks show self-evolution improves over the base model but plateaus, leaving a non-trivial gap to oracle supervision, though multi-turn critic-revision with large models can nearly match oracle performance.

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes