Improved GUI Grounding via Iterative Narrowing
This work addresses GUI grounding for VLM agents, offering incremental improvements in a domain-specific task.
The paper tackles the problem of suboptimal GUI grounding in Vision-Language Models by introducing an iterative narrowing visual prompting framework, resulting in improved performance over baseline and fine-tuned models as tested on a comprehensive UI benchmark.
Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.