HCAIAug 5, 2024

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

arXiv:2408.11824v468 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of automating complex, multi-step operations on mobile applications for users and developers, representing an incremental improvement in agent adaptability.

The paper tackles the problem of enabling LLM-driven visual agents to interact flexibly with mobile device interfaces by introducing a novel multimodal agent framework that constructs a flexible action space and uses exploration and deployment phases with RAG technology. Experimental results show superior performance across benchmarks, confirming effectiveness in real-world scenarios.

With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible action space that enhances adaptability across various applications including parser, text and vision descriptions. The agent operates through two main phases: exploration and deployment. During the exploration phase, functionalities of user interface elements are documented either through agent-driven or manual explorations into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and update from this knowledge base, thereby empowering the agent to perform tasks effectively and accurately. This includes performing complex, multi-step operations across various applications, thereby demonstrating the framework's adaptability and precision in handling customized task workflows. Our experimental results across various benchmarks demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios. Our code will be open source soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes