CLFeb 18, 2024

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

arXiv:2402.11443v179 citationsh-index: 22Has CodeCOLING
Originality Incremental advance
AI Analysis

This addresses the need for more accurate and scalable evaluation of LLMs for researchers and practitioners, though it is incremental as it builds on existing benchmark methods.

The paper tackles the problem of dynamically evaluating Large Language Models (LLMs) by introducing a benchmark self-evolving framework that uses a multi-agent system to reframe instances, extending existing benchmarks. Experimental results show a general performance decline in most LLMs, with widened performance discrepancies, more accurately reflecting model capabilities.

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes