CLJun 2, 2023

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

arXiv:2306.01311v1227 citationsh-index: 64
Originality Incremental advance
AI Analysis

This addresses the lack of in-context learning in vision-language models, which is incremental as it adapts existing language model techniques to a multimodal domain.

The paper tackles enabling in-context learning for vision-language models by transferring this ability from language models, resulting in a model that outperforms baselines on VQA, OK-VQA, and GQA with 20 times fewer parameters.

Large-scale language models have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to VL domain? Specifically, we first meta-trains a language model to perform in-context learning on NLP tasks (as in MetaICL); then we transfer this model to perform VL tasks by attaching a visual encoder. Our experiments suggest that indeed in-context learning ability can be transferred cross modalities: our model considerably improves the in-context learning capability on VL tasks and can even compensate for the size of the model significantly. On VQA, OK-VQA, and GQA, our method could outperform the baseline model while having 20 times fewer parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes