Krause Synchronization Transformers
This work addresses synchronization issues in Transformers for researchers and practitioners, offering a scalable inductive bias to improve efficiency and performance, though it is incremental as it builds on existing attention mechanisms.
The paper tackled the problem of representation collapse and attention sink phenomena in Transformers by introducing Krause Attention, a mechanism based on bounded-confidence consensus dynamics, which achieved consistent performance gains across vision, autoregressive generation, and large language models while reducing runtime complexity from quadratic to linear in sequence length.
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.