HCAILGApr 25, 2024

Benchmarking Mobile Device Control Agents across Diverse Configurations

arXiv:2404.16660v337 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This addresses the problem of quantifying progress in mobile automation for researchers and developers, though it is incremental as it builds on existing agent methods by providing a new benchmark.

The authors tackled the lack of standardized evaluation for mobile device control agents by introducing B-MoCA, a benchmark with 131 daily tasks on Android that randomizes device configurations to test generalization; they found existing agents perform well on simple tasks but poorly on complex ones, highlighting research gaps.

Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes