AIFeb 19

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

arXiv:2602.17038v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in agentic reinforcement learning by enhancing expert specialization, though it appears incremental as it builds on MoE architectures.

The paper tackles the problem of simplicity bias in reinforcement learning for LLM agents, where single policy networks cause simple tasks to dominate parameters, by proposing Phase-Aware Mixture of Experts (PA-MoE) to allocate temporally consistent expert assignments, resulting in improved performance as demonstrated in experiments.

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes