CVFeb 6, 2025

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

arXiv:2502.04192v310.23 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This work addresses the limitations of current pixel-level multi-modal large language models for researchers in computer vision and multi-modal AI, providing critical benchmarks and analysis.

The paper tackles the problem that pixel-level vision foundation models trained with segmentation supervision perform poorly on visual question answering tasks, and shows that models without such supervision can outperform state-of-the-art methods on new challenging benchmarks for VQA and grounding.

Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data with specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call PixFoundation. More importantly, we study the research question of "When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?" We show that grounding can coincide with object parts, its location, appearance, context or state, where we show 27-45% of the examples in both benchmarks exhibit this phenomenon. Our code and datasets will be made publicly available and some are in the supplemental.

View on arXiv PDF Code

Similar