Free Energy Mixer

arXiv:2602.07160v11 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses a bottleneck in attention mechanisms for machine learning practitioners, offering a plug-and-play enhancement to standard and linear attention, linear RNNs, and SSMs, though it is incremental as it builds on existing attention frameworks.

The paper tackles the limitation of standard attention's per-head convex averaging by introducing the Free Energy Mixer (FEM), which uses a free-energy read to enable per-channel selection, resulting in consistent performance improvements over strong baselines in NLP, vision, and time-series tasks at matched parameter budgets.

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes