A Mechanistic Investigation of Supervised Fine Tuning
For researchers studying LLM fine-tuning, this work provides a mechanistic understanding of representational changes during SFT, though it is incremental as it applies existing SAE methods to a new analysis.
The paper investigates how Supervised Fine-Tuning (SFT) alters a large language model's internal representations, finding that while activation similarity remains high, sparse autoencoder latents diverge significantly, revealing task- and layer-specific semantic feature changes.
The cosine similarity between a large language model's hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model's activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae-investigation.