ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
This addresses the hardware challenges in inference for vision models, offering an incremental improvement by bridging efficiency and performance gaps.
The paper tackles the quadratic complexity of Vision Transformers (ViTs) by introducing ViT-Linearizer, a distillation framework that transfers ViT knowledge into a linear-time recurrent model, achieving 84.3% top-1 accuracy on ImageNet with notable speedups for high-resolution tasks.
Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.