CVApr 2

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Naomi Kombol, Ivan MartinoviÄ, SiniÅ¡a Å egviÄ, Giorgos Tolias

arXiv:2604.0225258.9Has Code

Predicted impact top 58% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses computational bottlenecks in dense prediction tasks like segmentation for computer vision applications, representing an incremental improvement in efficiency.

The paper tackles the inefficiency of high-resolution image processing in Vision Transformers for open-vocabulary segmentation by introducing SPAR, a resolution-agnostic feature extractor that improves single-pass baselines by up to 10.5 mIoU and surpasses a sliding-window teacher.

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

View on arXiv PDF Code

Similar