CV LGJul 13, 2023

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Yiren Jian, Chongyang Gao, Soroush Vosoughi

arXiv:2307.07063v419.350 citationsh-index: 31Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of resource-intensive vision-language learning for AI researchers, offering a flexible and modality-agnostic method that is incremental in its approach.

The paper tackles the challenge of efficiently using frozen large language models in vision-language pre-training by focusing on optimizing prompts for visual features, resulting in significant performance improvements for the BLIP-2 baseline and reducing the gap between models trained with 4M and 129M image-text pairs.

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code will be made available at https://github.com/yiren-jian/BLIText.

View on arXiv PDF Code

Similar