LGAIETMar 7

PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks

arXiv:2603.13347h-index: 1
AI Analysis

This work addresses the problem of activation function rigidity in transformers for AI researchers, offering a novel, efficient method with potential broad impact, though it is incremental as it builds on existing transformer architectures.

The paper tackles the limitation of transformers using a single fixed activation function in feed-forward networks by introducing PolyGLU, a drop-in replacement that enables dynamic routing among multiple activation functions, resulting in emergent near-deterministic routing with depth-dependent specialization and achieving 62-89% of a larger model's performance on benchmarks while training on significantly fewer tokens.

Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism combining learned static preferences with input-conditioned gating, trained end-to-end with Gumbel-Softmax. We train PolychromaticLM, a 597M-parameter transformer, on ~10B tokens using a single NVIDIA A100 GPU. Our key finding is emergent routing behavior: without any explicit sparsity loss or entropy regularization, the routing mechanism converges to near-deterministic activation selections (mean dynamic entropy = 0.030% of maximum), with a striking depth-dependent specialization pattern -- early layers prefer GELU while deep layers strongly favor Tanh. Three layers maintain elevated routing entropy, suggesting computational flexibility points. The routing architecture adds only 0.23% parameter overhead (~1.4M parameters) and proves fully robust to supervised fine-tuning: routing entropy remains constant at ln(4) throughout 13,067 SFT steps. On standard benchmarks, PolychromaticLM achieves 62-89% of Qwen3-0.6B-Base performance despite training on 3,600x fewer tokens. All code, weights, and training infrastructure are released under Apache 2.0.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes