Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
This addresses the problem of mobile UI automation and accessibility for developers and users, offering a scalable solution that overcomes limitations of view hierarchies, though it is incremental as it builds on existing vision-language models.
The paper tackles mobile UI understanding by proposing Spotlight, a vision-only approach that uses a vision-language model with a focus region on screenshots, establishing state-of-the-art results on several UI tasks and outperforming previous methods that rely on view hierarchies.
Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen -- the focus -- as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.