CL AI LGJan 22, 2025

Autonomy-of-Experts Models

Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

arXiv:2501.13074v212.06 citationsh-index: 11ICML

Originality Highly original

AI Analysis

This addresses a critical bottleneck in MoE models for improving efficiency and performance in large-scale language modeling, though it is an incremental advancement within the MoE framework.

The paper tackles the suboptimal expert selection in Mixture-of-Experts models by proposing Autonomy-of-Experts, a new paradigm where experts autonomously select themselves based on internal activations, and shows it outperforms traditional MoE models in pre-trained language models up to 4B parameters.

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

View on arXiv PDF

Similar