CLMar 14, 2025

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang

arXiv:2503.11170v14 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses the data scarcity problem for researchers and developers building GUI agents in desktop scenarios, though it is incremental as it builds on existing data generation and model training approaches.

The authors tackled the lack of graphical user interface (GUI) data for desktop agents by creating an automated pipeline, AutoCaptioner, to generate a large-scale dataset, DeskVision, and a test benchmark, DeskVision-Eval, which enabled training a model, GUIExplorer, that achieved state-of-the-art performance in GUI understanding without complex designs.

The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.

View on arXiv PDF

Similar