CVFeb 22, 2025

FeatSharp: Your Vision Model Features, Sharper

Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao

arXiv:2502.16025v214.46 citationsh-index: 58Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in computer vision for tasks requiring high-resolution features, such as semantic segmentation and object detection, but is incremental as it builds on existing vision transformer backbones.

The paper tackles the problem of low-resolution feature maps in vision encoders like CLIP, which are typically limited to 224x224px, by introducing a method to cheaply upsample these features to capture fine-grained details. They demonstrate effectiveness on core perception tasks and in model training with RADIO for distillation.

The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$px, while the "high-resolution" versions are around $378-448$px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp .

View on arXiv PDF Code

Similar