AICLCVHCJun 17, 2024

GUICourse: From General Vision Language Models to Versatile GUI Agents

arXiv:2406.11317v2117 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of developing practical GUI agents for human-computer interaction, representing an incremental advancement by enhancing existing VLMs with specialized datasets.

The paper tackles the challenge of adapting general Vision Language Models (VLMs) to perform GUI navigation tasks by addressing deficiencies in OCR, grounding, and GUI knowledge, resulting in improved performance on common GUI tasks, with a small 3.1B-parameter agent effectively handling single-step and multi-step tasks.

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes