CL LGNov 18, 2024

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

arXiv:2411.12103v323 citationsh-index: 11Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the critical problem of ensuring that unlearning methods in large language models genuinely remove harmful information for AI safety, but it is incremental as it primarily evaluates existing methods.

The paper investigates the effectiveness of LLM unlearning methods (LLMU and RMU) by evaluating their impact on model performance and robustness, revealing that these methods significantly degrade general capabilities and can be easily circumvented, failing to achieve true unlearning.

Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the impact of unlearning on LLM performance metrics using the WMDP dataset as well as a new biology dataset we create. We show that unlearning has a notable impact on general model capabilities, with the performance degradation being more significant in general for LLMU. We further test the robustness of the two methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. Our methodology serves as an evaluation framework for LLM unlearning methods. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.

View on arXiv PDF Code

Similar