Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling
This addresses the problem of complex routing in MoE upcycling for deep learning practitioners, offering an incremental improvement over existing methods.
The paper tackles the challenge of inefficient training in Mixture-of-Experts (MoE) models by proposing Router Upcycling, a novel routing technique that initializes multiple routers from attention heads to improve token assignment, achieving state-of-the-art performance in MoE upcycling.
The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.