CLAILGJan 22, 2025

Autonomy-of-Experts Models

arXiv:2501.13074v26 citationsh-index: 11ICML
Originality Highly original
AI Analysis

This addresses a critical bottleneck in MoE models for improving efficiency and performance in large-scale language modeling, though it is an incremental advancement within the MoE framework.

The paper tackles the suboptimal expert selection in Mixture-of-Experts models by proposing Autonomy-of-Experts, a new paradigm where experts autonomously select themselves based on internal activations, and shows it outperforms traditional MoE models in pre-trained language models up to 4B parameters.

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes