LGAICLJun 17, 2024

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

arXiv:2406.11780v122 citations
Originality Incremental advance
AI Analysis

This work addresses safety risks in LLMs for users and society, offering an incremental enhancement to existing unlearning techniques.

The paper tackles the problem of removing harmful behaviors and knowledge from large language models (LLMs) to improve safety, proposing the SPUNGE framework that splits unlearning data by attributes, unlearns subsets separately, and merges models, resulting in significant performance improvements for unlearning methods while maintaining general capabilities.

Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes