AIApr 29, 2025

A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

arXiv:2504.20464v26 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

It addresses the problem of enabling intelligent interaction with digital systems for users and developers, but it is incremental as it is a survey paper summarizing existing work.

This paper surveys recent advances in GUI agents that use Multi-modal Large Language Models enhanced by Reinforcement Learning, highlighting how innovations in multimodal perception and adaptive action generation have improved generalization and robustness in complex environments.

Graphical User Interface (GUI) agents, driven by Multi-modal Large Language Models (MLLMs), have emerged as a promising paradigm for enabling intelligent interaction with digital systems. This paper provides a structured survey of recent advances in GUI agents, focusing on architectures enhanced by Reinforcement Learning (RL). We first formalize GUI agent tasks as Markov Decision Processes and discuss typical execution environments and evaluation metrics. We then review the modular architecture of (M)LLM-based GUI agents, covering Perception, Planning, and Acting modules, and trace their evolution through representative works. Furthermore, we categorize GUI agent training methodologies into Prompt-based, Supervised Fine-Tuning (SFT)-based, and RL-based approaches, highlighting the progression from simple prompt engineering to dynamic policy learning via RL. Our summary illustrates how recent innovations in multimodal perception, decision reasoning, and adaptive action generation have significantly improved the generalization and robustness of GUI agents in complex real-world environments. We conclude by identifying key challenges and future directions for building more capable and reliable GUI agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes