WaterDrum: Watermarking for Data-centric Unlearning Metric
This addresses the need for accurate unlearning assessment in LLMs for applications involving private or harmful data, though it is incremental as it focuses on improving evaluation rather than unlearning itself.
The paper tackles the problem of evaluating unlearning in large language models by introducing WaterDrum, a data-centric metric that uses text watermarking to overcome limitations of utility-centric metrics, and it includes new benchmark datasets for rigorous evaluation.
Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or (c) the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.