CRAICLLGJun 19, 2024

Textual Unlearning Gives a False Sense of Unlearning

arXiv:2406.13348v313 citations
Originality Highly original
AI Analysis

This work addresses privacy risks for users of language models by revealing that existing unlearning mechanisms are unreliable and potentially harmful, highlighting an incremental but critical gap in machine unlearning security.

The paper tackles the problem of textual unlearning in language models by showing that current methods are ineffective and can even increase privacy risks, as demonstrated by their attacks which detect unlearned texts with high confidence and expose them to membership inference and data reconstruction.

Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes