CVCLDec 9, 2021

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

arXiv:2112.05253v2304 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of simplifying and improving multimodal generative modeling for researchers and practitioners, though it is incremental as it builds on prior methods like Frozen.

The paper tackles the limitations of existing vision-language models by introducing MAGMA, a method that augments generative language models with visual inputs using adapter-based finetuning, achieving state-of-the-art results on the OKVQA benchmark and competitive performance on other benchmarks while using only 0.2% of the training samples compared to SimVLM.

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes