\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
This addresses a critical bottleneck for GUI automation systems in real-world applications, enabling more reliable interaction, though it is an incremental improvement over existing methods.
The paper tackles the problem of unreliable visual grounding in GUI systems, which limits accurate pointer-level actions like clicking, by introducing GUI-Spotlight, a model that uses adaptive iterative focus refinement to improve accuracy, achieving 52.8% accuracy on the ScreenSpot-Pro benchmark with only 18.5K training samples, outperforming existing models with much larger datasets.
Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).