LGAICLMay 21

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

arXiv:2605.2886056.8h-index: 3Has Code
Predicted impact top 26% in LG · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers studying catastrophic forgetting in LLMs, this work provides a mechanistic explanation for RL's robustness, though it is an incremental extension of existing behavioral observations.

The paper investigates why reinforcement learning (RL) preserves prior capabilities better than supervised fine-tuning (SFT) during LLM fine-tuning. Using a head-level measure of circuit degradation, they find that SFT adapts faster but causes more circuit disruption and forgetting, while RL preserves circuits better at the cost of slower adaptation.

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes