Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
This work addresses the problem of understanding and mitigating social biases in LLMs for AI safety and fairness, though it is incremental in applying mechanistic interpretability to a known bias.
The study investigated how the Knobe effect, a moral bias in intentionality judgments, emerges in finetuned LLMs and found that it is localized in specific layers, with patching activations from pretrained models into these layers eliminating the bias without retraining.
Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.