CL CR LG MLJun 12, 2025

UCD: Unlearning in LLMs via Contrastive Decoding

Vinith M. Suriyakumar, Ayush Sekhari, Ashia Wilson

arXiv:2506.12097v112.08 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the need to remove sensitive or undesirable content from LLMs, though it appears incremental as it builds on existing unlearning methods with a novel inference-time technique.

The paper tackles the problem of removing specific information from large language models while preserving overall performance by proposing an inference-time unlearning algorithm using contrastive decoding with auxiliary models. Results on TOFU and MUSE benchmarks show notable gains in forget quality and retained performance compared to prior approaches.

Machine unlearning aims to remove specific information, e.g. sensitive or undesirable content, from large language models (LLMs) while preserving overall performance. We propose an inference-time unlearning algorithm that uses contrastive decoding, leveraging two auxiliary smaller models, one trained without the forget set and one trained with it, to guide the outputs of the original model using their difference during inference. Our strategy substantially improves the tradeoff between unlearning effectiveness and model utility. We evaluate our approach on two unlearning benchmarks, TOFU and MUSE. Results show notable gains in both forget quality and retained performance in comparison to prior approaches, suggesting that incorporating contrastive decoding can offer an efficient, practical avenue for unlearning concepts in large-scale models.

View on arXiv PDF

Similar