LGAISep 9, 2024

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

arXiv:2409.05929v69 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of modality collapse in multimodal learning for AI researchers, offering a potentially new basis for self-supervised learning, though it appears incremental as it builds on existing architectures like JEPA and MMoE.

The paper tackles modality collapse in multimodal learning by proposing M3-JEPA, a framework that uses a Joint-Embedding Predictive Architecture with a Multi-Gate Mixture of Experts for cross-modal alignment in latent space, achieving state-of-the-art performance across different modalities and tasks with generalization to unseen datasets and computational efficiency.

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes