AISep 22, 2025

Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent

Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song, Jiaxing Zhang

arXiv:2509.17917v15.82 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the challenge of enhancing reasoning reliability and data efficiency for GUI agents in interactive tasks, representing a strong specific gain rather than a broad breakthrough.

The paper tackles the problem of unreliable reward signals and limited online trajectory generation in GUI agents by introducing Orcust, a framework that integrates Principle-Constrained Reward Modeling and Online VM-Grounded Trajectory Construction, achieving state-of-the-art performance with improvements of 22.2% on ScreenSpot and 23.9% on ScreenSpot-Pro over the base model.

Recent advances in GUI agents have achieved remarkable grounding and action-prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle-Constrained Reward Modeling (PCRM) and Online VM-Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment-verifiable and LLM-derived principle to enforce interpretable reward signals that constrain long chain-of-thought reasoning and rule-based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task-specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end-to-end task execution reveal that Orcust achieves state-of-the-art performance, improving by 22.2\% on ScreenSpot and 23.9\% on ScreenSpot-Pro over the base model (i.e. Qwen2.5-VL-7B). The results demonstrate Orcust's effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.

View on arXiv PDF

Similar