LGAICVMay 28, 2025

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

arXiv:2505.22310v112 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses a security problem for machine learning practitioners by making unlearning more tamper-resistant, though it is incremental as it builds on existing unlearning methods.

The paper tackles the vulnerability of unlearning methods to relearning attacks, where unlearned knowledge re-emerges after fine-tuning, and finds that resistance can be predicted by weight-space properties, leading to new methods achieving state-of-the-art resistance.

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes