Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces
This work addresses the limited automation in software testing, accessibility, and human-computer interaction by enabling AI agents to interact with GUIs, though it is incremental as it adapts existing visual grounding techniques to a synthetic image domain.
The paper tackles the problem of visual grounding for Graphical User Interfaces (GUIs) by proposing Instruction Visual Grounding (IVG) methods, including IVGocr and IVGdirect, to locate GUI elements based on natural language instructions, and introduces datasets and a new metric (Central Point Validation) for evaluation.
Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and (2) IVGdirect, which uses a multimodal architecture for end-to-end grounding. For each method, we introduce a dedicated dataset. In addition, we propose the Central Point Validation (CPV) metric, a relaxed variant of the classical Central Proximity Score (CPS) metric. Our final test dataset is publicly released to support future research.