LGCLApr 18, 2025

A mean teacher algorithm for unlearning of language models

arXiv:2504.13388v12 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of unlearning in language models for privacy or compliance, but it is incremental as it builds on existing methods.

The paper tackles the problem of reducing memorization of selected text instances in language models without degrading utility, by applying a mean teacher algorithm with a new negative log-unlikelihood loss, showing improvements on MUSE benchmarks.

One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes