Self-supervised pretraining for an iterative image size agnostic vision transformer
This work addresses the problem of high computational demands in vision models for researchers and practitioners, offering an incremental improvement by adapting existing self-supervised methods to a novel architecture.
The paper tackles the computational inefficiency and poor scaling of Vision Transformers with image size by introducing a self-supervised pretraining framework for an iterative, image-size agnostic model, achieving competitive performance on ImageNet-1K and downstream tasks while maintaining constant computational cost across resolutions.
Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.