CLFeb 16, 2025

The Mirage of Model Editing: Revisiting Evaluation in the Wild

Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng

arXiv:2502.11177v521.824 citationsh-index: 32Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the gap between synthetic and real-world evaluation in model editing, which is crucial for researchers and practitioners aiming to reliably update knowledge in LLMs for practical applications.

The paper tackles the problem of overestimated performance in model editing for large language models (LLMs) by introducing new benchmarks and evaluation frameworks, revealing that current methods perform substantially worse in real-world scenarios (38.5% vs. 96.8%) and fail drastically with sequential editing.

Despite near-perfect results reported in the literature, the effectiveness of model editing in real-world applications remains unclear. To bridge this gap, we introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and WILD, a task-agnostic evaluation framework designed to better reflect real-world usage of model editing. Our single editing experiments show that current editing methods perform substantially worse than previously reported (38.5% vs. 96.8%). We demonstrate that it stems from issues in the synthetic evaluation practices of prior work. Among them, the most severe is the use of teacher forcing during testing, which leaks both content and length of the ground truth, leading to overestimated performance. Furthermore, we simulate practical deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. This work calls for a shift in model editing research toward rigorous evaluation and the development of robust, scalable methods that can reliably update knowledge in LLMs for real-world use.

View on arXiv PDF Code

Similar