CVAIApr 1, 2025

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

arXiv:2504.00356v114 citationsh-index: 3Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of improving region-text alignment for researchers and practitioners in computer vision, offering incremental advancements in zero-shot referring image segmentation.

The paper tackles the challenge of extracting precise mask region representations in zero-shot referring image segmentation by introducing a training-free hybrid global-local feature extraction approach and a spatial guidance augmentation strategy, achieving substantial performance gains over existing models on standard benchmarks.

Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes