CVCLLGJun 25, 2021

Multimodal Few-Shot Learning with Frozen Language Models

arXiv:2106.13884v2970 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enabling multimodal AI systems to adapt quickly with minimal data, which is incremental by building on existing language model capabilities.

The paper tackles the problem of transferring few-shot learning capabilities from large language models to multimodal (vision and language) settings, resulting in a system that can rapidly learn new tasks like object recognition and visual question-answering with only a few examples, achieving competitive performance on various benchmarks.

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes