CVDec 2, 2024

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

arXiv:2412.01268v122 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses the need for flexible, human-like GUI automation across diverse software platforms, representing a novel method rather than an incremental improvement.

The paper tackles the problem of GUI agents relying on non-visual inputs by introducing Ponder & Press, a visual-only framework that uses multimodal large language models to interpret instructions and locate GUI elements, achieving a +22.5% improvement on a grounding benchmark and state-of-the-art performance across various environments.

Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes