CL LGApr 1

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

arXiv:2604.0075443.6

AI Analysis

This addresses the problem of scaling attention mechanisms for large language models, offering a practical enhancement for researchers and practitioners in natural language processing, though it is incremental as it builds on existing linear and sparse attention approaches.

The paper tackles the computational inefficiency of attention mechanisms in transformers by proposing Stochastic Attention (SA), a method inspired by the fruit fly connectome that applies random permutations to token sequences before windowed attention, achieving full sequence coverage in O(log_w n) layers versus O(n/w) for standard sliding-window attention. In experiments, SA outperformed sliding-window attention and matched or exceeded Mixture of Block Attention in zero-shot accuracy and inference tasks on models like Qwen3-8B and Qwen3-30B-A3B.

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

View on arXiv PDF

Similar