CVCLOct 17, 2024

Harnessing Webpage UIs for Text-Rich Visual Understanding

arXiv:2410.13824v325 citationsh-index: 22ICLR
Originality Incremental advance
AI Analysis

This work addresses the challenge of processing dense textual content integrated with visuals for multimodal models, with incremental improvements in web UI tasks and generalization to non-UI domains.

The authors tackled the problem of text-rich visual understanding in multimodal large language models by synthesizing multimodal instructions from webpage UIs using text-based LLMs, resulting in up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on Mind2Web.

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes