CLFeb 17, 2025

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

arXiv:2502.11431v19 citationsh-index: 10ACL
Originality Incremental advance
AI Analysis

This work addresses the need for efficient multimodal information retrieval across diverse applications, representing an incremental advancement by unifying formats into screenshots.

The paper tackles the problem of retrieving multimodal information by proposing Visualized Information Retrieval (Vis-IR), which unifies texts, images, tables, and charts into screenshots, and shows that their UniSE models achieve substantial improvements over existing methods on the MVRB benchmark.

With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct \textbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes