DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
This work addresses the challenge of processing multi-view sensory data and rare maneuvers in autonomous driving, offering a novel approach that could improve robustness and scalability for real-world applications.
The paper tackles the problem of end-to-end autonomous driving by proposing DriveMoE, a Mixture-of-Experts framework that combines Scene-Specialized Vision MoE and Skill-Specialized Action MoE to handle diverse and complex driving scenarios, achieving state-of-the-art performance in Bench2Drive closed-loop evaluation.
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.