CVNov 20, 2025

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

arXiv:2511.16301v18 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses a scalability and generalization bottleneck for pixel-level applications in computer vision, offering a lightweight solution that is broadly applicable across architectures and modalities, though it is incremental in building on existing upsampling techniques.

The paper tackles the problem of low-resolution feature representations in Vision Foundation Models, which limits pixel-level applications, by introducing Upsample Anything, a test-time optimization framework that restores high-resolution outputs without training, achieving state-of-the-art performance in tasks like semantic segmentation and depth estimation with a runtime of approximately 0.419 seconds per 224x224 image.

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes