Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis
This addresses the need for better evaluation of LLM unlearning to ensure ethical and safe AI deployment, though it is incremental as it builds on existing unlearning methods with a novel evaluation approach.
The paper tackles the problem of evaluating LLM unlearning methods, which often rely on single-value metrics that fail to capture nuanced retention of harmful knowledge, by proposing UNCD, a framework using Cognitive Diagnosis Modeling for fine-grained evaluation and targeted unlearning, demonstrating enhanced removal of harmful abilities across eight methods and two base models.
Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation via Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.