CVApr 20, 2022

Residual Mixture of Experts

arXiv:2204.09636v352 citationsh-index: 62
Originality Incremental advance
AI Analysis

This work addresses the computational bottleneck for researchers and practitioners using MoE transformers in vision tasks, offering a more efficient training method with significant performance gains, though it is incremental as it builds on existing MoE frameworks.

The paper tackles the high computational cost of training Mixture of Experts (MoE) vision transformers by proposing Residual Mixture of Experts (RMoE), an efficient pipeline that achieves comparable results to upper-bound MoE training while saving over 30% training cost and gaining up to +1.6 AP on tasks like segmentation and detection with minimal additional cost.

Mixture of Experts (MoE) is able to scale up vision transformers effectively. However, it requires prohibiting computation resources to train a large MoE transformer. In this paper, we propose Residual Mixture of Experts (RMoE), an efficient training pipeline for MoE vision transformers on downstream tasks, such as segmentation and detection. RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost than the lower-bound non-MoE training pipelines. The efficiency is supported by our key observation: the weights of an MoE transformer can be factored into an input-independent core and an input-dependent residual. Compared with the weight core, the weight residual can be efficiently trained with much less computation resource, e.g., finetuning on the downstream data. We show that, compared with the current MoE training pipeline, we get comparable results while saving over 30% training cost. When compared with state-of-the-art non- MoE transformers, such as Swin-T / CvT-13 / Swin-L, we get +1.1 / 0.9 / 1.0 mIoU gain on ADE20K segmentation and +1.4 / 1.6 / 0.6 AP gain on MS-COCO object detection task with less than 3% additional training cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes