CLAIFeb 20, 2025

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

arXiv:2502.15836v22 citationsh-index: 31Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work highlights a critical flaw in current auditing methods for unlearning, which is important for ensuring privacy and compliance in AI systems, but it is incremental as it builds on prior research on STA vulnerabilities.

The paper tackles the problem of auditing machine unlearning in large language models by showing that soft token attacks (STA) are unreliable, as they can elicit any information regardless of the unlearning algorithm or original training data, with attacks using 1-10 tokens generating random strings over 400 characters long.

Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used carefully to effectively audit unlearning. Example code can be found at: https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes