CVAICLLGFeb 12, 2024

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Stanford
arXiv:2402.07865v2332 citationsh-index: 66ICML
Originality Incremental advance
AI Analysis

This work addresses the challenge of inconsistent evaluations and under-explored design decisions in VLMs, which are crucial for applications like visual dialogue and robotic task planning, though it is incremental in improving existing models.

The paper tackled the lack of standardized evaluation and understanding of design factors in visually-conditioned language models (VLMs) by compiling a suite of evaluations and investigating key design axes, resulting in a family of VLMs that outperform state-of-the-art open models like InstructBLIP and LLaVa v1.5 at the 7-13B scale.

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open VLMs.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes