CVAICLNov 18, 2024

Improved GUI Grounding via Iterative Narrowing

arXiv:2411.13591v712 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses GUI grounding for VLM agents, offering incremental improvements in a domain-specific task.

The paper tackles the problem of suboptimal GUI grounding in Vision-Language Models by introducing an iterative narrowing visual prompting framework, resulting in improved performance over baseline and fine-tuned models as tested on a comprehensive UI benchmark.

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes