CV AI CLAug 16, 2024

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang

SalesforceStanfordUW

arXiv:2408.08872v438.1142 citationsh-index: 64Has Code

Originality Synthesis-oriented

AI Analysis

This provides an incremental open-source tool for researchers working on multimodal AI, facilitating community development and benchmarking.

The paper introduces BLIP-3, an open framework for developing large multimodal models, releasing 4B and 14B models that achieve competitive performance among open-source LMMs of similar sizes on tasks like single and multi-image benchmarks.

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

View on arXiv PDF

Similar