CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
This addresses the problem of high training costs for multimodal MoE models like CLIP, offering a practical solution for building efficient models, though it is incremental as it builds on existing CLIP and MoE techniques.
The paper tackled the challenge of efficiently training Mixture-of-Experts (MoE) CLIP models by proposing CLIP-UP, a strategy that converts pre-trained dense CLIP into sparse MoE, reducing training complexity and cost. The result showed that their sparse CLIP B/16 model outperformed its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k benchmarks and surpassed a larger model with only 30% of inference FLOPs.
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.