Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
This reveals a critical flaw in unlearning methods for AI safety, challenging assumptions about knowledge removal and calling for better evaluation frameworks.
The study demonstrated that some machine unlearning methods fail under prompt attacks, with ELM vulnerable to attacks like prepending Hindi filler text recovering 57.3% accuracy, while others like RMU and TAR showed robustness.
In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge.