CLAIMay 4, 2024

Has this Fact been Edited? Detecting Knowledge Edits in Language Models

arXiv:2405.02765v315 citationsh-index: 13NAACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of malicious knowledge editing in language models, which is critical for user trust and transparency in generative AI, though it is incremental as it builds on existing knowledge editing methods.

The paper tackles the problem of detecting whether a fact generated by a language model is based on edited knowledge or original pre-training knowledge, proposing a novel task and establishing a baseline using AdaBoost classifiers with hidden state and probability features that achieve strong performance in cross-domain settings.

Knowledge editing methods (KEs) can update language models' obsolete or inaccurate knowledge learned from pre-training. However, KEs can be used for malicious applications, e.g., inserting misinformation and toxic content. Knowing whether a generated output is based on edited knowledge or first-hand knowledge from pre-training can increase users' trust in generative models and provide more transparency. Driven by this, we propose a novel task: detecting edited knowledge in language models. Given an edited model and a fact retrieved by a prompt from an edited model, the objective is to classify the knowledge as either unedited (based on the pre-training), or edited (based on subsequent editing). We instantiate the task with four KEs, two LLMs, and two datasets. Additionally, we propose using the hidden state representations and the probability distributions as features for the detection. Our results reveal that, using these features as inputs to a simple AdaBoost classifiers establishes a strong baseline. This classifier requires only a limited amount of data and maintains its performance even in cross-domain settings. Last, we find it more challenging to distinguish edited knowledge from unedited but related knowledge, highlighting the need for further research. Our work lays the groundwork for addressing malicious model editing, which is a critical challenge associated with the strong generative capabilities of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes