CLFeb 19, 2024

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

arXiv:2402.11941v324.776 citationsh-index: 34Has CodeACL

Originality Highly original

AI Analysis

This work addresses the problem of enhancing autonomous agents for smartphone GUI automation, representing an incremental improvement with novel methods for known bottlenecks.

The paper tackles the challenge of improving GUI automation performance by proposing CoCo-Agent, which introduces comprehensive environment perception and conditional action prediction, achieving new state-of-the-art results on AITW and META-GUI benchmarks.

Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at https://github.com/xbmxb/CoCo-Agent.

View on arXiv PDF Code

Similar