AINov 11, 2025

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

arXiv:2511.08172v33.3h-index: 1

Originality Incremental advance

AI Analysis

This addresses the need for efficient training of reasoning-capable GUI agents, though it appears incremental as it builds on existing methods with improved data curation and adaptation.

The paper tackles the problem of training visual grounding models for GUI agents by introducing an efficient pipeline that filters 4.8M synthetic examples down to 12K clean instances and applies lightweight training strategies. The resulting 3B-parameter model matches or surpasses larger baselines on benchmarks like ScreenSpot, Multimodal-Mind2Web, and AndroidControl.

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

View on arXiv PDF

Similar