CVSep 18, 2025

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li

arXiv:2509.15221v229.244 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses the data bottleneck for researchers developing general-purpose computer use agents, though it is incremental as it builds on existing vision-language model approaches.

The paper tackles the problem of limited large-scale open-source data for computer use agents (CUAs) by introducing ScaleCUA, a dataset spanning 6 operating systems and 3 task domains, which when used for training achieves strong performance gains over baselines (e.g., +26.6 on WebArena-Lite-v2) and sets new state-of-the-art results (e.g., 94.4% on MMBench-GUI L1-Hard).

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

View on arXiv PDF Code

Similar