CVJun 6, 2022

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

arXiv:2206.02770v1321 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses the problem of efficient multimodal learning for AI applications by improving performance over dense models, though it is incremental as it builds on existing MoE and contrastive learning methods.

The paper tackles multimodal learning by introducing LIMoE, a sparse mixture of experts model that processes images and text simultaneously using contrastive loss, achieving 78.6% zero-shot ImageNet accuracy compared to 76.2% for CLIP and scaling to 84.1% with additional data.

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes