Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective
This work addresses the challenge of precise information removal in LLMs for applications like privacy and content moderation, representing an incremental improvement over existing unlearning methods.
The paper tackles the problem of targeted unlearning in large language models, where only specific information about a target (e.g., a person) is removed from the model, and introduces a causal intervention framework that achieves competitive performance across datasets without explicit optimization for criteria like avoiding gibberish or factual fabrication.
This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case. Experiments on existing and new datasets show that our approach, without explicitly optimizing for the aforementioned criteria, achieves competitive performance in all of them. Our code is available at https://github.com/UCSB-NLP-Chang/causal_unlearn.git.