WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
This addresses a critical bottleneck in real-world GUI automation for developers and users by providing a more robust benchmark and agent framework, though it is incremental in improving existing methods.
The paper tackles the challenge of GUI automation planning being sensitive to initial environment states, such as software not being open, by introducing WorldGUI, a benchmark with diverse initial states across ten applications, and WorldGUI-Agent, a framework that improves success rates by 12.4% over Claude-3.5 Computer Use on WorldGUI and 31.2% on WindowsAgentArena, surpassing prior SOTA by 11.7%.
GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to the sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state, often lead to planning errors. This issue is widespread in real application scenarios, but existing benchmarks fail to evaluate it. To address this gap, we introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications (e.g., PowerPoint, VSCode, Acrobat), each instantiated with diverse initial states to simulate authentic human-computer interactions. Complementing this, we propose WorldGUI-Agent, a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization to proactively detect and correct errors. Experimental evaluation shows that WorldGUI-Agent outperforms the outstanding existing model (Claude-3.5 Computer Use) by 12.4% in success rate on WorldGUI, and achieves a 31.2% overall success rate on WindowsAgentArena, surpassing the prior state-of-the-art by 11.7%. Our analysis further reveals that dynamic augmentation tasks and desktop environments pose substantial hurdles, underscoring the necessity of adaptive planning and feedback-driven execution for advancing real-world GUI automation. The code and data are available at https://github.com/showlab/WorldGUI.