CLJan 31, 2025

Improving LLM Unlearning Robustness via Random Perturbations

arXiv:2501.19202v45 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses a critical vulnerability in LLM unlearning for AI safety and reliability, though it is incremental as it builds on existing unlearning methods.

The paper tackles the problem that current LLM unlearning methods reduce model robustness, causing misbehavior when a forget-token is present in retain-queries, and proposes Random Noise Augmentation (RNA) to improve robustness while preserving performance.

Here, we show that current state-of-the-art LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as backdoor attacks and defenses: forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes