CVAILGDec 19, 2024

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

arXiv:2412.15209v114 citationsh-index: 18
Originality Highly original
AI Analysis

This work addresses the problem of detailed visual reasoning across multiple images for applications requiring pixel-level analysis, representing a novel integration rather than an incremental improvement.

The paper tackles the limitation of existing vision-language models in performing fine-grained comparisons across multiple images with pixel-level grounding by introducing PRIMA, a model that integrates multi-image reasoning with pixel grounding, achieving a 25.3% reduction in TFLOPs and outperforming state-of-the-art baselines on a new benchmark.

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes