CVAIJun 16, 2025

HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

arXiv:2506.13925v22 citationsh-index: 10
Originality Highly original
AI Analysis

This addresses label efficiency and generalization in segmentation for computer vision applications, representing a novel integration rather than an incremental improvement.

The paper tackles semi-supervised semantic segmentation under domain shift by leveraging vision-language models to generate domain-invariant text queries, achieving state-of-the-art improvements including +9.3% mIoU on COCO with only 232 labeled images.

In this paper, we address Semi-supervised Semantic Segmentation (SSS) under domain shift by leveraging domain-invariant semantic knowledge from text embeddings of Vision-Language Models (VLMs). We propose a unified Hierarchical Vision-Language framework (HVL) that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network to improve generalization and reduce misclassification under limited supervision. The mentioned textual queries are used for grouping pixels with shared semantics under SSS. HVL is designed to (1) generate textual queries that maximally encode domain-invariant semantics from VLM while capturing intra-class variations; (2) align these queries with spatial visual features to enhance their segmentation ability and improve the semantic clarity of visual features. We also introduce targeted regularization losses that maintain vision--language alignment throughout training to reinforce semantic understanding. HVL establishes a novel state-of-the-art by achieving a +9.3% improvement in mean Intersection over Union (mIoU) on COCO, utilizing 232 labelled images, +3.1% on Pascal VOC employing 92 labels, +4.8% on ADE20 using 316 labels, and +3.4% on Cityscapes with 100 labels, demonstrating superior performance with less than 1% supervision on four benchmark datasets. Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes