LG CL CVSep 27, 2023

Jointly Training Large Autoregressive Multimodal Models

Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz

Meta AI

arXiv:2309.15564v224.835 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses the problem of multimodal generation for AI applications, representing a novel approach but with incremental elements in method integration.

The authors tackled the challenge of integrating text and image generation into a single robust model for seamless multimodal outputs, resulting in a model that demonstrates unparalleled performance in generating high-quality multimodal outputs.

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

View on arXiv PDF

Similar