CV HCFeb 6, 2025

RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

arXiv:2502.03971v13.6ICME

Originality Incremental advance

AI Analysis

This work addresses challenges in webpage layout comprehension and multi-step interactive reasoning for applications in UI automation and accessibility, representing an incremental advancement with specific enhancements.

The paper tackles the problem of information loss and limited reasoning in Visual Language Models when processing high-resolution web interfaces, proposing RWKV-UI with layout detection and Chain-of-Thought visual prompts, resulting in significant performance improvements in UI understanding and interactive reasoning tasks.

Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model's ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.

View on arXiv PDF

Similar