CVJan 7, 2025

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

arXiv:2501.03675v23 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the lack of open-source datasets and benchmarks for multi-image reasoning, which is incremental but fills a specific gap in the field.

The paper tackles the problem of multi-image reasoning in vision-language models by introducing SMiR, a synthetic data-generation pipeline that produced 160K training samples and SMiR-Bench, a benchmark with 200 examples, resulting in improved performance on complex reasoning tasks.

Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image reasoning benchmark comprising 200 diverse examples across seven complex reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on SMiR-Bench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes