AIMar 21, 2025

Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study

arXiv:2503.16788v118.111 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

It addresses the unclear impact of reasoning capabilities on real-world applications like mobile GUI agents, providing insights for future enhancements, though the findings are incremental due to mixed results.

This paper empirically studies whether chain-of-thought reasoning improves mobile GUI agents, finding that reasoning-enhanced models achieve state-of-the-art performance in interactive environments but offer only marginal gains or even degrade performance in static benchmarks.

Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at https://github.com/LlamaTouch/VLM-Reasoning-Traces.

View on arXiv PDF Code

Similar