CVAIJan 5

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

arXiv:2601.02204v18 citationsh-index: 19
Originality Highly original
AI Analysis

This work addresses multimodal AI challenges for applications like image editing and video generation, representing a novel method rather than an incremental improvement.

The paper tackles the problem of multimodal understanding and generation by proposing NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion tokens, which generates 1024x1024 images in 5 seconds and achieves state-of-the-art performance among unified models.

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes