Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts
This work addresses scalability and performance issues in visual generation models for AI researchers, but it appears incremental as it builds on existing MoE and diffusion transformer frameworks.
The paper tackled the challenge of scaling diffusion transformers with Mixture of Experts by introducing Race-DiT, a novel model with a flexible routing strategy called Expert Race, which achieved significant performance gains on ImageNet.
Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.