CV CLJun 3

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

arXiv:2606.0470186.3Has Code

AI Analysis

For GUI agent researchers, this work highlights a missing capability axis (observation control) in dynamic environments, but is incremental as it extends existing agent frameworks to a new domain.

The paper formalizes the task of Living-Screen-Native GUI agents for dynamic interfaces like short-video platforms, introduces the LivingScreen benchmark, and finds that current models fail to match human cost-accuracy performance due to over- and under-observation.

GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.

View on arXiv PDF Code

Similar