CVLGDec 31, 2024

Probing Visual Language Priors in VLMs

arXiv:2501.00569v428 citationsh-index: 9Has CodeICML
AI Analysis

This addresses a critical limitation in VLMs for AI researchers and developers by exposing and mitigating reliance on spurious correlations, though it is incremental as it builds on existing VLM methods.

The paper tackles the problem of Vision-Language Models (VLMs) over-relying on text priors instead of visual reasoning by introducing the ViLP benchmark with out-of-distribution images and questions, where humans achieve near-perfect accuracy but GPT-4 scores only 66.17%. They propose a self-improving framework that generates and corrupts VQA data for self-training, boosting performance in models like LLaVA-v1.5 and Cambrian.

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes