CVNov 30, 2025

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

arXiv:2512.00846v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work improves GUI automation for industries relying on mobile UI optimization, though it appears incremental as it builds on existing instruct-BLIP-based architectures.

The paper tackles the problem of GUI automation for mobile user interfaces by addressing limitations in visual language models, such as inaccurate widget identification and large model sizes, and introduces AFRAgent, which achieves superior performance while being less than one-fourth the size of its nearest competitor, establishing a new state-of-the-art on Meta-GUI and AITW benchmarks.

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes