LG AIMay 16

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

arXiv:2605.1677696.1Has Code

Predicted impact top 4% in LG · last 90 daysOriginality Highly original

AI Analysis

For LLM safety researchers, this work addresses the fundamental limitations of existing unlearning paradigms (biased deletion vs. knowledge re-emergence) by unifying erasure and refusal in a principled manner.

The paper proposes Distinguishable Deletion (D^2), a new paradigm for LLM unlearning that restricts response distributions in latent space rather than specific tokens, enabling both knowledge erasure and refusal. The method, implemented via Energy-based Unlearning Alignment (EUA), significantly outperforms prior approaches across multiple benchmarks.

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

View on arXiv PDF Code

Similar