CV AI CL LGFeb 12, 2024

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh

Stanford

arXiv:2402.07865v244.9352 citationsh-index: 66Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses the challenge of inconsistent evaluations and under-explored design decisions in VLMs, which are crucial for applications like visual dialogue and robotic task planning, though it is incremental in improving existing models.

The paper tackled the lack of standardized evaluation and understanding of design factors in visually-conditioned language models (VLMs) by compiling a suite of evaluations and investigating key design axes, resulting in a family of VLMs that outperform state-of-the-art open models like InstructBLIP and LLaVa v1.5 at the 7-13B scale.

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open VLMs.

View on arXiv PDF Code

Similar