Precise In-Parameter Concept Erasure in Large Language Models
This addresses the need for precise concept removal in LLMs to mitigate risks like sensitive information leakage, offering an incremental improvement over existing erasure methods.
The paper tackles the problem of removing undesirable knowledge from large language models by proposing PISCES, a framework that directly edits parameters to erase concepts, achieving reduced accuracy on target concepts to as low as 7.7% and improving specificity and robustness by up to 31% and 38%.
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.