AIFeb 10, 2025

AppVLM: A Lightweight Vision Language Model for Online App Control

arXiv:2502.06395v115 citationsh-index: 12
Originality Highly original
AI Analysis

This work addresses the problem of efficient and adaptable app control for smartphone users, providing a practical solution for real-world deployment.

The authors tackled the problem of creating a lightweight vision language model for online app control, achieving the highest action prediction accuracy and matching the performance of GPT-4o while being up to ten times faster. Their model, AppVLM, achieves a high success rate in online task completion.

The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes