CVAIMay 3, 2024

What matters when building vision-language models?

arXiv:2405.02246v1354 citationsh-index: 13NIPS
Originality Incremental advance
AI Analysis

This work addresses the issue of unjustified design decisions in vision-language models, which impedes progress for researchers and developers in the field, though it is incremental as it builds on existing methods.

The paper tackles the problem of unclear design choices in vision-language models by conducting extensive experiments on pre-trained models, architecture, data, and training methods, resulting in Idefics2, an 8-billion-parameter model that achieves state-of-the-art performance in its size category and matches models four times larger on multimodal benchmarks.

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes