CLFeb 2

A2Eval: Agentic and Automated Evaluation for Embodied Brain

Shuai Zhang, Jiayu Hu, Zijie Chen, Zeyuan Ding, Yi Zhang, Yingji Zhang, Ziyi Zhou, Junwei Liao, Shengjie Zhou, Yong Dai, Zhenzhong Lan, Xiaozhu Ju

arXiv:2602.01640v10.61 citationsh-index: 8

Originality Highly original

AI Analysis

This addresses the problem of costly and distorted model evaluation for researchers in embodied AI, representing a novel method rather than an incremental improvement.

The paper tackles the problem of labor-intensive and biased evaluation in embodied vision-language models by proposing A2Eval, an agentic framework that automates benchmark curation and evaluation. The result shows it compresses evaluation suites by 85%, reduces computational costs by 77%, achieves a 4.6x speedup, and improves human alignment to Spearman's rho=0.85 while maintaining high ranking fidelity.

Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.

View on arXiv PDF

Similar