Retrieval-augmented GUI Agents with Generative Guidelines
This addresses the challenge of automating complex digital tasks for GUI agents, offering a plug-and-play solution, though it appears incremental as it builds on existing VLM-based methods.
The paper tackles the problem of GUI agents being limited by scarce training data and complex tasks requiring rare knowledge, by proposing RAG-GUI, a lightweight VLM that leverages web tutorials at inference time, which outperforms baselines by 2.6% to 13.3% across tasks.
GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.