CVApr 22

Self-supervised pretraining for an iterative image size agnostic vision transformer

arXiv:2604.2039239.3h-index: 31
AI Analysis

This work addresses the problem of high computational demands in vision models for researchers and practitioners, offering an incremental improvement by adapting existing self-supervised methods to a novel architecture.

The paper tackles the computational inefficiency and poor scaling of Vision Transformers with image size by introducing a self-supervised pretraining framework for an iterative, image-size agnostic model, achieving competitive performance on ImageNet-1K and downstream tasks while maintaining constant computational cost across resolutions.

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes