CVSep 24, 2024

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Yong Xien Chng, Xuchong Qiu, Yizeng Han, Kai Ding, Wan Ding, Gao Huang

arXiv:2409.16278v22.0h-index: 24

Originality Highly original

AI Analysis

This work addresses the challenge of generalizing open-vocabulary segmentation across diverse domains, offering a more efficient adaptation method for vision-language models.

The paper tackles the problem of open-vocabulary segmentation by identifying mask classification as the bottleneck and proposes a Fine-grained Semantic Adaptation (FISA) method, achieving state-of-the-art results with improvements of up to +1.0 PQ and +3.0 mIoU and reducing training costs by nearly 5x.

Despite extensive research, open-vocabulary segmentation methods still struggle to generalize across diverse domains. To reduce the computational cost of adapting Vision-Language Models (VLMs) while preserving their pre-trained knowledge, most methods freeze the VLMs for mask classification and train only the mask generator. However, our comprehensive analysis reveals a surprising insight: open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. This discovery prompts us to rethink the existing paradigm and explore an alternative approach. Instead of freezing the VLM, we propose to freeze the pre-trained mask generator and focus on optimizing the mask classifier. Building on the observation that VLMs pre-trained on global-pooled image-text features often fail to capture fine-grained semantics necessary for effective mask classification, we propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process. As our method strategically optimizes only a small portion of the VLM's parameters, it enjoys the efficiency of adapting to new data distributions while largely preserving the valuable VLM pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, FISA achieves new state-of-the-art results across multiple representative benchmarks, improving performance by up to +1.0 PQ and +3.0 mIoU and reduces training costs by nearly 5x compared to previous best methods. Our code and data will be made public.

View on arXiv PDF

Similar