MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
This work addresses the problem of simplifying and improving multimodal generative modeling for researchers and practitioners, though it is incremental as it builds on prior methods like Frozen.
The paper tackles the limitations of existing vision-language models by introducing MAGMA, a method that augments generative language models with visual inputs using adapter-based finetuning, achieving state-of-the-art results on the OKVQA benchmark and competitive performance on other benchmarks while using only 0.2% of the training samples compared to SimVLM.
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.