CVAIFeb 7, 2024

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

arXiv:2402.04615v3113 citationsh-index: 11IJCAI
Originality Incremental advance
AI Analysis

This addresses the challenge of interpreting visual elements in human-machine interaction for applications like UI navigation and infographic analysis, representing a domain-specific advancement.

The authors tackled the problem of understanding user interfaces and infographics by introducing ScreenAI, a vision-language model that achieves new state-of-the-art results on UI- and infographics-based tasks, such as Multi-page DocVQA and WebSRC, with a 5B parameter model.

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes