LGFeb 13

Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

arXiv:2602.12587v12 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses catastrophic forgetting in continual learning for MoE Transformers, offering an incremental improvement by modifying routing granularity.

The paper tackles the problem of catastrophic forgetting in Mixture-of-Experts (MoE) Transformers by identifying multi-head attention as a pre-routing bottleneck that causes routing collisions, and proposes MH-MoE with head-wise routing to reduce forgetting, achieving a reduction in backward transfer from 11.2% to 4.5% on Qwen3-0.6B.

Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number $N_{eff}$ and find that higher $N_{eff}$ is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes