CV CL LGOct 24, 2024

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan

AppleGeorgia Tech

arXiv:2410.18967v227.552 citationsh-index: 20ICLR

Originality Incremental advance

AI Analysis

This work addresses the problem of universal UI understanding for developers and users across platforms like iPhone, Android, and Web, representing an incremental advancement over prior models.

The paper tackles the challenge of building a generalist model for user interface understanding across diverse platforms by introducing Ferret-UI 2, a multimodal large language model that significantly outperforms its predecessor and demonstrates strong cross-platform transfer capabilities on multiple benchmarks.

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

View on arXiv PDF

Similar