CVApr 20, 2025

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

arXiv:2504.14666v128 citationsh-index: 14CVPR
Originality Highly original
AI Analysis

This addresses the challenge of integrating LLMs and diffusion models for multimodal AI, offering a novel approach that could enhance applications in image generation and understanding, though it appears incremental in improving token design.

The paper tackles the problem of unifying visual comprehension and generation in multimodal large language models by proposing discrete diffusion timestep tokens to create a proper visual language, achieving superior performance in both tasks compared to existing methods.

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes