CVSep 24, 2024

MonoFormer: One Transformer for Both Diffusion and Autoregression

arXiv:2409.16280v146 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the inefficiency of separate models for multimodal generation, though it is incremental as it builds on existing transformer applications.

The paper tackles the problem of using separate backbones for autoregressive text generation and diffusion-based visual generation by proposing MonoFormer, a single transformer shared for both tasks, achieving comparable image generation performance to state-of-the-art methods while maintaining text generation capability.

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes