CVDec 14, 2023

CogAgent: A Visual Language Model for GUI Agents

Tsinghua
arXiv:2312.08914v3752 citationsh-index: 36Has CodeCVPR
Originality Incremental advance
AI Analysis

It addresses the problem of automating GUI interactions for users of digital devices, representing a domain-specific advancement.

The paper introduces CogAgent, an 18-billion-parameter visual language model designed to understand and navigate graphical user interfaces (GUIs), achieving state-of-the-art results on nine VQA benchmarks and outperforming LLM-based methods on GUI navigation tasks like Mind2Web and AITW.

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM, with a new version of CogAgent-9B-20241220 available at https://github.com/THUDM/CogAgent.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes