CLFeb 25, 2025

RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

TencentTsinghua
arXiv:2502.18308v14 citationsh-index: 12Has CodeIEEE Transactions on Audio, Speech, and Language Processing
Originality Incremental advance
AI Analysis

This work addresses the challenge of dynamic evaluation for LLM responses to refutation, which is incremental as it extends an existing benchmark with agent-based methods.

The authors tackled the problem of evaluating LLMs' ability to incorporate user refutation feedback in multi-turn interactions by introducing RefuteBench 2.0, which uses LLM agents as refuters and evaluators, and found that current models effectively satisfy refutations but fail to memorize the information, with performance decreasing as refutations increase.

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes