AICLCVMay 29, 2025

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

arXiv:2505.23762v10.1526 citationsh-index: 46Has Code
AI Analysis85

This addresses the problem of high annotation costs and limited adaptability for developers of GUI automation systems, representing a novel approach rather than an incremental improvement.

The paper tackles the limitations of offline learning for GUI agents by proposing ZeroGUI, an online framework that automates training without human annotations, resulting in significant performance boosts on OSWorld and AndroidLab environments.

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes