CLMar 26, 2025

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

arXiv:2503.20978v13 citationsh-index: 20Has CodeWWW
Originality Incremental advance
AI Analysis

This work addresses the problem of building scalable and intelligent GUI agents for user assistance and automation, representing an incremental advancement in the field.

The paper tackles the challenge of training GUI agents by proposing a stateful screen schema representation and ScreenLLM multimodal models, achieving accurate user behavior modeling and action prediction in experiments.

Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes