CVIRJun 2, 2025

Entity Image and Mixed-Modal Image Retrieval Datasets

arXiv:2506.02291v1h-index: 28Has Code
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for researchers in multimodal learning to evaluate image retrieval models that require cross-modal understanding, though it is incremental as it builds on existing datasets like WIT.

The paper tackles the lack of challenging benchmarks for mixed-modal image retrieval by introducing two new datasets, the Entity Image Dataset and the Mixed-Modal Image Retrieval Dataset, which are validated through human annotations and serve as both training and evaluation resources.

Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes