CV AINov 12, 2024

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese

arXiv:2411.07461v113.512 citationsh-index: 27Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for more knowledgeable multimodal models in AI by providing a large-scale dataset, though it is incremental as it builds on existing methods for data augmentation.

The authors tackled the problem of generating factually grounded image captions by creating BLIP3-KALE, a dataset of 218 million image-text pairs that combines synthetic dense captions with web-scale alt-text, resulting in improved performance on vision-language tasks for trained models.

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

View on arXiv PDF

Similar