CVDec 14, 2024

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

arXiv:2412.10840v112 citationsh-index: 7Has CodeAAAI
Originality Incremental advance
AI Analysis

This addresses the challenge of accurate GUI component identification for autonomous interaction systems, offering a novel tuning-free approach that is incremental but potentially reduces data and computational costs.

The paper tackles the problem of grounding GUI components like text or icons from images and text queries without fine-tuning, proposing a tuning-free attention-driven method that achieves performance comparable to tuning-based approaches, with notable success in text localization.

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes