LGCLCVHCMay 31, 2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

arXiv:2306.00245v284 citations
Originality Highly original
AI Analysis

This work addresses the challenge of building more human-like agents for GUI interaction, moving beyond reliance on structured data, with a significant but domain-specific advancement.

The paper tackles the problem of creating digital agents that interact with graphical user interfaces using pixel-based screenshots and generic keyboard/mouse actions, achieving performance that surpasses human crowdworkers on the MiniWob++ benchmark.

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes