CVAIJun 30, 2025

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

arXiv:2506.23491v36 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses GUI grounding for cross-resolution applications, offering a practical solution for developers with limited resources, though it is incremental in improving efficiency and adaptability.

The paper tackles GUI grounding tasks by introducing ZonUI-3B, a lightweight vision-language model that achieves 84.9% accuracy on ScreenSpot and 86.4% on ScreenSpot-v2, comparable to larger models while being trainable on a single consumer GPU.

In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes