SPoT: Subpixel Placement of Tokens in Vision Transformers
This work addresses a bottleneck in Vision Transformers for computer vision researchers, offering a novel approach to improve efficiency and interpretability, though it appears incremental as it builds on existing ViT frameworks.
The paper tackled the problem of standard tokenization methods confining features to discrete patch grids in Vision Transformers, which limits sparsity exploitation, and proposed SPoT, a subpixel token placement strategy that reduces token counts for accurate predictions during inference, with substantial performance gains demonstrated through oracle-guided search.
Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.