CLAIFeb 26, 2025

FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

arXiv:2502.19207v26 citationsh-index: 12EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of preventing superficial unlearning in language models for applications requiring privacy and security, representing an incremental improvement over prior methods by focusing on interconnected knowledge.

The paper tackles the problem of ensuring faithful knowledge removal in language models by addressing interconnected knowledge, introducing a new benchmark called FaithUn and a method named KLUE that updates only knowledge-related neurons, with experiments showing significant effectiveness in real-world QA unlearning compared to existing methods.

Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes