Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
This addresses the safety and controllability of large language models, showing that direct model editing through mechanistic interpretability has limitations, which is incremental as it builds on existing localization hypotheses.
The researchers tackled the problem of whether undesirable behaviors like deception in language models are localized and removable, finding that deception was highly resilient and attempts to remove it caused a gradual decay in general linguistic performance, with a consistent rise in perplexity.
Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.