CLAILGSep 29, 2025

Retrieval-augmented GUI Agents with Generative Guidelines

arXiv:2509.24183v111 citationsh-index: 17EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of automating complex digital tasks for GUI agents, offering a plug-and-play solution, though it appears incremental as it builds on existing VLM-based methods.

The paper tackles the problem of GUI agents being limited by scarce training data and complex tasks requiring rare knowledge, by proposing RAG-GUI, a lightweight VLM that leverages web tutorials at inference time, which outperforms baselines by 2.6% to 13.3% across tasks.

GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes