CVAILGFeb 12, 2025

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

arXiv:2502.08226v26 citationsh-index: 22025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Highly original
AI Analysis

This work addresses the limitations of existing GUI agents, providing a more comprehensive and adaptable solution for GUI understanding, which is significant for developers and users of LVLM-based GUI agents.

The authors tackled the problem of GUI comprehension by introducing TRISHUL, a novel framework that enhances generalist LVLMs, achieving superior performance in action grounding across four datasets and surpassing the ToL agent on the ScreenPR benchmark. TRISHUL's results demonstrate its ability to provide multi-granular, spatially, and semantically enriched representations of GUI elements.

Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes