CLAIJun 30, 2025

AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

arXiv:2506.23735v12 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more robust and realistic evaluation of LLMs, particularly for researchers and developers, though it is incremental as it builds on prior evolutionary data augmentation methods.

The authors tackled the problem of static and insufficient evaluation benchmarks for large language models (LLMs) by proposing AutoEvoEval, an automated framework for evolving close-ended evaluation data, which resulted in an average accuracy drop of 7.283% for LLMs and up to 52.932% amplification of adversarial effects with multi-step compositions.

Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes