CL AI CV LGJan 31, 2023

Grounding Language Models to Images for Multimodal Inputs and Outputs

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

CMU

arXiv:2301.13823v422.8161 citationsh-index: 119Has Code

Originality Incremental advance

AI Analysis

This provides a general solution for leveraging pretrained language models in visually grounded settings, which is incremental as it builds on existing models with added cross-modality layers.

The authors tackled the problem of enabling text-only language models to process and generate interleaved image-and-text data by grounding them to the visual domain, achieving strong zero-shot performance on tasks like contextual image retrieval and multimodal dialogue.

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

View on arXiv PDF Code

Similar