Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
This provides the AI community with high-performing, fully open vision-language models and datasets, addressing a critical gap in foundational knowledge for building such models from scratch without relying on proprietary data.
The authors tackled the lack of open-source vision-language models by introducing Molmo, a family of models that achieve state-of-the-art performance among open-weight and open-data models, outperforming larger proprietary models like Claude 3.5 Sonnet and Gemini 1.5 Pro, second only to GPT-4o on benchmarks and human evaluations.
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.