Android in the Zoo: Chain-of-Action-Thought for GUI Agents
This addresses the challenge of improving GUI agent efficiency for smartphone automation, though it appears incremental by building on existing context modeling approaches.
The paper tackles the problem of GUI agents for smartphones by introducing Chain-of-Action-Thought (CoAT), which incorporates action thinking and outcomes into context modeling, significantly improving action prediction in zero-shot settings and enabling a 1B model to match the performance of an 18B model through fine-tuning on a new dataset.
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.