AIMar 4, 2025

AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

arXiv:2503.02403v27 citationsh-index: 9
AI Analysis

This addresses the scalability and practicality issues in mobile agent evaluation for developers and researchers, though it is incremental as it builds on existing evaluation concepts.

The paper tackles the problem of evaluating mobile agents by proposing AutoEval, a framework that automatically generates task reward signals and conducts evaluations without manual effort, achieving up to 94% accuracy comparable to human evaluation.

Comprehensive evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks lack practicality and scalability due to the extensive manual effort in defining task reward signals and implementing evaluation codes. We propose AutoEval, an evaluation framework which tests mobile agents without any manual effort. Our approach designs a UI state change representation which can be used to automatically generate task reward signals, and employs a Judge System for autonomous evaluation. Evaluation shows AutoEval can automatically generate reward signals with high correlation to human-annotated signals, and achieve high accuracy (up to 94%) in autonomous evaluation comparable to human evaluation. Finally, we evaluate state-of-the-art mobile agents using our framework, providing insights into their performance and limitations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes