CVJul 17, 2024

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

arXiv:2407.12442v1116 citationsh-index: 35
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in applying CLIP to open-vocabulary semantic segmentation, offering incremental improvements for computer vision tasks.

The paper tackled the problem of noisy segmentation maps in CLIP-based semantic segmentation by identifying residual connections as a primary source of noise, and proposed ClearCLIP with modifications that improved segmentation accuracy across multiple benchmarks.

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes