CVMay 3

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

Junyuan Xiao, Dingkang Liang, Xin Zhou, Yixuan Ye, Tongtong Su, Guangmo Yi, Bin Xia, Qiang Lyu, Shurui Shi, Jun Huang, Jianlou Si, Wenming Yang

arXiv:2605.0189693.8

AI Analysis

For researchers in multi-modal video generation, this method addresses the underutilization of foundation model priors, offering a novel alignment approach that yields strong empirical gains.

The paper proposes M^2-REPA, the first representation alignment method for multi-modal video generation, which decouples and aligns modality-specific features with expert foundation models, significantly improving visual quality and long-term consistency over baselines.

Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary "experts." Specifically, we first decouple modality-specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.

View on arXiv PDF

Similar