LGApr 7

Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

arXiv:2604.0601419.7
AI Analysis

This work addresses efficiency bottlenecks in vision transformers for computer vision tasks, but it is incremental as it builds on existing Swin and Retentive Network methods.

The authors tackled the problem of improving vision transformer efficiency by combining Swin windowed attention with Retentive Networks' spatial decay and input-dependent gating, resulting in Gated-SwinRMT variants that achieved up to 80.22% top-1 accuracy on Mini-ImageNet, a 6.48 percentage point gain over the baseline.

We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed. \textbf{Gated-SwinRMT-SWAT} substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate -- projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ -- to alleviate the low-rank $W_V \!\cdot\! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$--$79$\,M parameters, Gated-SwinRMT-SWAT achieves $80.22\%$ and Gated-SwinRMT-Retention $78.20\%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74\%$ for the RMT baseline. On CIFAR-10 -- where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope -- the accuracy advantage compresses from $+6.48$\,pp to $+0.56$\,pp.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes