CVJun 10, 2025

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

arXiv:2506.08678v22 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the limitation of vision-language models in fine-grained dense prediction for applications like object detection and segmentation, but it is incremental as it builds on existing CLIP adaptation methods.

The paper tackled the problem of CLIP's poor fine-grained, region-level understanding in open-vocabulary dense prediction tasks by proposing ATAS, a self-distillation method that enhances semantic coherence and fine-grained alignment without extra modules or supervised fine-tuning, achieving substantial performance gains on benchmarks.

Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes