CLJun 13, 2024

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

arXiv:2406.09325v616 citations
Originality Incremental advance
AI Analysis

This addresses privacy concerns for users of language models by providing a method to unlearn sensitive information, though it is incremental as it builds on existing unlearning and model editing approaches.

The paper tackles the problem of language models inadvertently memorizing and divulging sensitive information, proposing REVS, a non-gradient-based method that modifies neurons for constituent tokens, which shows superior performance in unlearning sensitive information and robustness to extraction attacks while retaining model integrity.

Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens that form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes