CLAIFeb 15, 2024

UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models

arXiv:2402.10052v244 citationsh-index: 10NAACL
Originality Incremental advance
AI Analysis

This addresses privacy and safety concerns in large language models by providing a more stable unlearning method, though it is incremental as it builds on existing unlearning techniques.

The paper tackles the problem of unstable unlearning in large language models when removing sensitive information, and introduces UnDIAL, a method that achieves robust unlearning with smooth convergence and scalability.

Mitigating the retention of sensitive or private information in large language models is essential for enhancing privacy and safety. Existing unlearning methods, like Gradient Ascent and Negative Preference Optimization, directly tune models to remove unwanted information. However, these methods often become unstable because they fine-tune by maximizing cross-entropy loss, which is the opposite of traditional loss minimization in learning. This reversal creates instability, especially on larger datasets, as the model struggles to balance unlearning with maintaining language capacity, leading to over-unlearning. In this paper, we introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens. This technique ensures smooth convergence and avoids catastrophic forgetting, even in challenging unlearning tasks with large datasets and sequential unlearning requests. Extensive experiments show that UnDIAL can achieve both robustness in unlearning and scalability while maintaining stable training dynamics and resilience to hyperparameter tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes