Learn by Surprise, Commit by Proof
This addresses the issue of hallucination and inefficient knowledge acquisition in large language models, offering a method to improve semantic learning and protect existing knowledge, though it is incremental as it builds on existing fine-tuning and self-verification techniques.
The paper tackles the problem of models memorizing rather than learning semantically during fine-tuning, proposing LSCP, a self-gated post-training framework that learns only what the model does not know, verified against its own knowledge, resulting in a perturbation gap of 2.7-3.0x compared to 11.6x for standard fine-tuning.
We propose LSCP, a self-gated post-training framework for autonomous knowledge acquisition: learning only what a model does not already know, verified against what it does know, at a strength proportional to conviction, with no external oracle. When a passage produces anomalously high per-token loss, LSCP flags it, generates a Q&A chain that forces the model to articulate its own knowledge and identify gaps, then adjusts AdamW's $β_2$ proportionally to conviction depth k (the number of self-verification steps the passage survives) via $β_2 = 0.999 \cdot r^k$. The entire learning intensity is governed by a single parameter $r$. Beyond new knowledge, this process sharpens weakly encoded existing knowledge, which is a primary source of hallucination. The framework is self-extinguishing: as the model learns, per-token loss on learned passages decreases toward the surprisal threshold and the system progressively converges to standard AdamW. This models biological memory consolidation: temporary information in the context window is selectively consolidated into parametric weights, the model's long-term memory. Experiments on the reference model (Qwen3-14B) and across six models (8B--32B, four families) show that standard fine-tuning produces rote memorization (perturbation gap (the ratio of paraphrase to original perplexity) of 11.6 +- 0.2 x baseline) while all LSCP conditions learn semantically (2.7--3.0x). The r=1.0 condition (identical optimizer, nearly identical data, only Q&A format differs) confirms that the training data format, not $β_2$ gating, is the primary mechanism preventing memorization; gating instead protects neighboring knowledge from contamination by corrupt content (93 +- 7% accuracy on adjacent questions at r=0.98 vs. 90% baseline).