CVHCJun 20, 2024

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

arXiv:2406.14250v38 citations
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for improving GUI navigation and decision-making in MLLMs, but it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of high-quality data for multimodal large language models (MLLMs) in GUI navigation by developing E-ANT, a large-scale Chinese dataset with nearly 40,000 human traces and 5,000+ apps, and evaluated various MLLMs on it.

Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes