CVAICLAug 16, 2024

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

SalesforceStanfordUW
arXiv:2408.08872v4142 citationsh-index: 64Has Code
Originality Synthesis-oriented
AI Analysis

This provides an incremental open-source tool for researchers working on multimodal AI, facilitating community development and benchmarking.

The paper introduces BLIP-3, an open framework for developing large multimodal models, releasing 4B and 14B models that achieve competitive performance among open-source LMMs of similar sizes on tasks like single and multi-image benchmarks.

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes