CVCLApr 14, 2025

MIEB: Massive Image Embedding Benchmark

arXiv:2504.10471v19 citationsh-index: 48Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for a comprehensive benchmark for image and image-text embedding models, though it is incremental as it builds on existing evaluation frameworks by expanding scope and tasks.

The authors tackled the problem of fragmented evaluation of image embedding models by introducing the Massive Image Embedding Benchmark (MIEB), which spans 38 languages and 130 tasks, and found that no single model dominates across all categories while revealing hidden capabilities and limitations in advanced vision models.

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes