CVMay 4

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

arXiv:2605.0264198.41 citationsHas Code
AI Analysis

For multimodal AI researchers, this work provides a scalable and efficient unified model for both understanding and generation, with strong empirical results in video editing and generation tasks.

Mamoda2.5 introduces a unified AR-Diffusion framework with a 25B-parameter MoE design (128 experts, Top-8 routing) that activates only 3B parameters, achieving top-tier generation on VBench 2.0 and state-of-the-art video editing quality, matching proprietary models like Kling O1, while enabling up to 95.9x faster inference via distillation and RL.

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes