CVNov 15, 2025

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

arXiv:2511.12207v1h-index: 13
Originality Highly original
AI Analysis

This addresses the challenge of efficient and effective multimodal interaction for AI generation tasks, representing a novel paradigm rather than an incremental improvement.

The paper tackles the problem of multimodal fusion in diffusion models by introducing Mixture of States (MoS), a paradigm that uses a token-wise router to align modalities, achieving state-of-the-art results in text-to-image generation and editing with models 3B to 5B parameters matching or surpassing counterparts up to 4× larger.

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes