CL AIMar 2

Surgical Post-Training: Cutting Errors, Keeping Knowledge

arXiv:2603.01683v11 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses the efficiency-forgetting trade-off in LLM reasoning enhancement, though it appears incremental as it builds on existing methods like DPO.

The paper tackles the problem of enhancing reasoning in Large Language Models via post-training while avoiding catastrophic forgetting, achieving a 6.2% average accuracy improvement on Qwen3-8B with only 4k data pairs and 28 minutes of training.

Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT

View on arXiv PDF Code

Similar