CLAICVLGJan 31, 2023

Grounding Language Models to Images for Multimodal Inputs and Outputs

CMU
arXiv:2301.13823v4161 citationsh-index: 119
Originality Incremental advance
AI Analysis

This provides a general solution for leveraging pretrained language models in visually grounded settings, which is incremental as it builds on existing models with added cross-modality layers.

The authors tackled the problem of enabling text-only language models to process and generate interleaved image-and-text data by grounding them to the visual domain, achieving strong zero-shot performance on tasks like contextual image retrieval and multimodal dialogue.

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes