De-attribute to Forget for LLM Unlearning
This work is significant for researchers and practitioners in LLM development and deployment, offering an incremental improvement to existing unlearning methods by addressing the trade-off between forgetting and model utility.
This paper addresses the problem of LLM unlearning, which often suffers from over-forgetting and poor model utility when relying on prediction loss optimization. The authors propose DareU, a novel reinforcement learning framework that frames unlearning as zeroing out data attribution, updating the LLM to reduce the attribution score of generated responses to forget data owners. DareU outperforms existing baselines in achieving effective unlearning while maintaining a good balance between forget quality and model utility.
The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.