CLAICVHCMar 1, 2025

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks

arXiv:2503.00401v211 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in GUI agent efficiency for resource-limited applications, offering an incremental improvement over existing grounding techniques.

The paper tackles the format discrepancy between coordinate-oriented grounding and action-oriented reasoning in GUI agents for resource-constrained scenarios by proposing a query-oriented pivot approach called query inference, which infers user queries from screenshots and coordinates to improve understanding and alignment with reasoning tasks, achieving comparable or better performance to large-scale methods with less than 0.1% of training data.

Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes