CLAIFeb 28

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

arXiv:2603.00823v11.11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses a critical gap for LLM developers and users by showing that current unlearning methods may not be robust in real-world interactive use, which is incremental as it builds on prior static evaluations.

The paper tackled the problem of evaluating machine unlearning for large language models (LLMs) in interactive settings, finding that knowledge thought to be forgotten in static tests can often be recovered through multi-turn interactions like self-correction and dialogue-conditioned querying.

Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes