CLFeb 2

CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

arXiv:2602.02824v11 citations
Originality Incremental advance
AI Analysis

This work addresses safety and privacy issues in LLMs by improving unlearning techniques, though it is incremental as it builds on negative preference alignment methods.

The paper tackles the problem of selectively removing undesirable knowledge from large language models (LLMs) to address safety and privacy concerns, and introduces CATNIP, a method that achieves effective unlearning with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods on benchmarks like MUSE and WMDP.

Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes