CLMay 7

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

Hang Chen, Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang

arXiv:2605.0607673.2

AI Analysis

For researchers in mechanistic interpretability and LLM post-training, this work reveals a fundamental flaw in the 'Locate-then-Update' paradigm, highlighting the need for dynamic localization methods.

The paper challenges the assumption that static mechanistic interpretability can guide dynamic parameter updates in LLM post-training, showing through empirical analysis that circuits undergo 'Free Evolution' during fine-tuning, making static localization unreliable. It introduces metrics to quantify circuit dynamics and proposes a predictive framework for future research.

The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.

View on arXiv PDF

Similar