CVAINov 13, 2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Microsoft
arXiv:2311.07562v1157 citationsh-index: 52Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of automating smartphone interactions for users or developers, but it is incremental as it applies an existing large multimodal model to a new task.

The researchers tackled smartphone GUI navigation by developing MM-Navigator, a GPT-4V-based agent that achieved 91% accuracy in generating reasonable action descriptions and 75% accuracy in executing correct actions for single-step instructions on iOS, outperforming previous methods on Android in a zero-shot setting.

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes