CR AI CL LGJun 19, 2024

Textual Unlearning Gives a False Sense of Unlearning

Jiacheng Du, Zhibo Wang, Jie Zhang, Xiaoyi Pang, Jiahui Hu, Kui Ren

arXiv:2406.13348v315.413 citations

Originality Highly original

AI Analysis

This work addresses privacy risks for users of language models by revealing that existing unlearning mechanisms are unreliable and potentially harmful, highlighting an incremental but critical gap in machine unlearning security.

The paper tackles the problem of textual unlearning in language models by showing that current methods are ineffective and can even increase privacy risks, as demonstrated by their attacks which detect unlearned texts with high confidence and expose them to membership inference and data reconstruction.

Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.

View on arXiv PDF

Similar