CLFeb 2

A2Eval: Agentic and Automated Evaluation for Embodied Brain

arXiv:2602.01640v1h-index: 8
AI Analysis

This addresses the problem of costly and distorted model evaluation for researchers in embodied AI, representing a novel method rather than an incremental improvement.

The paper tackles the problem of labor-intensive and biased evaluation in embodied vision-language models by proposing A2Eval, an agentic framework that automates benchmark curation and evaluation. The result shows it compresses evaluation suites by 85%, reduces computational costs by 77%, achieves a 4.6x speedup, and improves human alignment to Spearman's rho=0.85 while maintaining high ranking fidelity.

Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes