CVSep 27, 2024

Emu3: Next-Token Prediction is All You Need

Tsinghua
arXiv:2409.18869v10.45694 citationsh-index: 15Has Code
AI Analysis70

This addresses the problem of simplifying multimodal AI for researchers and practitioners by eliminating the need for diffusion or compositional architectures, though it is incremental in advancing next-token prediction methods.

The paper tackles the challenge of excelling in multimodal tasks with next-token prediction by introducing Emu3, a suite of models trained solely on this approach, which outperforms established models like SDXL and LLaVA-1.6 in generation and perception tasks.

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes