LG AIMay 10

Mixture of Layers with Hybrid Attention

arXiv:2605.0951663.9

Predicted impact top 32% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses the computational inefficiency of standard MoE transformers by enabling finer-grained layer-level sparsity, benefiting large-scale model deployment.

The paper introduces Mixture of Layers (MoL), which replaces full-width transformer blocks with parallel thin blocks and uses hybrid attention to address attention coverage issues, achieving improved efficiency and performance in MoE transformers.

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

View on arXiv PDF

Similar