2-D SSM: A General Spatial Layer for Visual Transformers
This work addresses the need for better 2-D inductive bias in vision models, which is a domain-specific problem for computer vision researchers and practitioners, and it appears to be an incremental improvement by integrating a novel layer into existing transformer architectures.
The paper tackles the problem of designing models with appropriate 2-D inductive bias in computer vision by introducing a general spatial layer based on multidimensional State Space Models (SSMs) for Vision Transformers (ViTs). The result is significant performance enhancements across multiple ViT backbones and datasets with negligible additional parameters and inference time, as demonstrated empirically.
A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding