CLLGSep 12, 2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

arXiv:2309.05973v235 citationsh-index: 5
AI Analysis

This addresses the issue of harmful model outputs for users of language models, offering a targeted solution to mitigate specific bad behaviors.

The paper tackles the problem of undesirable behaviors in language models, such as toxic language generation, by proposing a targeted ablation approach that removes specific causal pathways; it demonstrates that ablating just 12 out of 11,600 edges in GPT-2 reduces toxicity with minimal performance degradation.

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes