Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance
This addresses the challenge of efficiently removing specific data from generative models, which is incremental as it builds on existing unlearning methods.
The paper tackles the problem of machine unlearning in large generative models for concentrated data distributions, introducing T3-Unlearning, which improves forget quality and generative utility on the TOFU benchmark with minimal runtime and parameter training.
We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised fine-tuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i) tempering the base distribution to flatten high-confidence spikes, and (ii) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier's risk to unlearning error, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark show that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.