CV AI LGOct 15, 2025

End-to-End Multi-Modal Diffusion Mamba

Chunhao Lu, Qiang Lu, Meichen Dong, Jake Luo

arXiv:2510.13253v114.46 citationsh-index: 4

Originality Highly original

AI Analysis

This addresses the challenge of unified multi-modal processing for AI applications requiring simultaneous handling of high-dimensional data like images and text, representing a novel direction rather than an incremental improvement.

The paper tackles the problem of separate encoders and decoders hindering joint representation learning in end-to-end multi-modal models by proposing MDM, a unified architecture using a Mamba-based diffusion model and variational autoencoder, which significantly outperforms existing end-to-end models and competes with SOTA models like GPT-4V in tasks such as image generation and visual question answering.

Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

View on arXiv PDF

Similar