AIMar 11

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan, Nathalie Baracaldo, Diyi Yang

arXiv:2603.11266v117.11 citationsh-index: 27

Predicted impact top 21% in AI · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the critical need for more robust evaluation of unlearning methods in LLMs to ensure safety, mitigate biases, and comply with legal mandates like the right to be forgotten, representing an incremental improvement over existing benchmarks.

The paper tackles the problem that existing unlearning methods for Large Language Models are brittle and current evaluation metrics fail to detect vulnerabilities, by proposing a dynamic framework that stress tests unlearning robustness with complex structured queries. The framework shows comparable coverage to existing benchmarks, aligns with prior evaluations, and uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings.

Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.

View on arXiv PDF

Similar