GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
This work addresses the problem of improving GUI automation for developers and researchers by providing a structured assessment framework, though it is incremental as it focuses on benchmarking rather than proposing new methods.
The paper tackles the performance gap between large vision language models (VLMs) and humans in GUI task automation by identifying missing core GUI knowledge, and introduces GUI Knowledge Bench, a benchmark across six platforms and 292 applications, showing VLMs struggle with system states, action prediction, and task verification.
Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.