Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
This work addresses a scalability and generalization bottleneck for pixel-level applications in computer vision, offering a lightweight solution that is broadly applicable across architectures and modalities, though it is incremental in building on existing upsampling techniques.
The paper tackles the problem of low-resolution feature representations in Vision Foundation Models, which limits pixel-level applications, by introducing Upsample Anything, a test-time optimization framework that restores high-resolution outputs without training, achieving state-of-the-art performance in tasks like semantic segmentation and depth estimation with a runtime of approximately 0.419 seconds per 224x224 image.
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.