CVAICLDec 28, 2023

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

AI2
arXiv:2312.17172v1325 citationsh-index: 49CVPR
Originality Incremental advance
AI Analysis

This addresses the need for unified AI systems capable of handling diverse tasks across vision, language, audio, and action, though it is incremental as it builds on prior multimodal work.

The authors tackled the problem of creating a single model that can understand and generate across multiple modalities (image, text, audio, action) by developing Unified-IO 2, which achieves state-of-the-art performance on the GRIT benchmark and strong results in over 35 benchmarks.

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes