CVCLJun 13, 2024

ReMI: A Dataset for Reasoning with Multiple Images

arXiv:2406.09175v129 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for new benchmarks to assess emerging multi-image reasoning capabilities in LLMs, though it is incremental as it focuses on dataset creation and benchmarking.

The authors tackled the problem of evaluating large language models' ability to reason with multiple images by introducing the ReMI dataset, which benchmarks several models and reveals a substantial performance gap compared to human-level proficiency.

With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: https://huggingface.co/datasets/mehrankazemi/ReMI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes