CVAICLJun 13, 2024

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

arXiv:2406.09411v2154 citationsHas Code
AI Analysis

This work addresses the need for better multi-image understanding in AI, which is crucial for applications like scene analysis and temporal reasoning, but it is incremental as it primarily provides a new benchmark rather than a novel method.

The authors tackled the problem of robust multi-image understanding in multimodal LLMs by introducing MuirBench, a comprehensive benchmark with 12 tasks and 10 relation categories, and found that even top models like GPT-4o and Gemini Pro struggled, achieving only 68.0% and 49.3% accuracy, while open-source models performed below 33.3%.

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes