OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
This work addresses the need for efficient and unified multimodal models in AI, offering a novel architecture that eliminates external components and reduces decoding steps, though it appears incremental in building on existing autoregressive and MoE techniques.
The paper tackles the problem of unified multimodal intelligence by introducing OneCAT, a decoder-only autoregressive model that integrates understanding, generation, and editing, achieving state-of-the-art performance across benchmarks with significant efficiency gains for high-resolution inputs.
We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.