CVAug 18, 2023

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

arXiv:2308.09779v32 citationsh-index: 65
Originality Highly original
AI Analysis

This work addresses the challenge of fine-grained correlation in referring image segmentation, which is incremental as it builds on existing methods by introducing a novel alignment module.

The paper tackles the problem of referring image segmentation by explicitly aligning vision and language features to improve text-to-pixel correlation, resulting in surpassing previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref datasets by large margins.

Referring image segmentation (RIS) aims to segment an object mentioned in natural language from an image. The main challenge is text-to-pixel fine-grained correlation. In the previous methods, the final results are obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage, resulting in suboptimal correlation. In this paper, we introduce EAVL, a method explicitly aligning vision and language features. In contrast to fixed convolution kernels, we introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Specifically, we generate multiple queries representing different emphases of language expression. These queries are transformed into a series of query-based convolution kernels, which are applied in the segmentation stage to produce a series of masks. The final result is obtained by aggregating all masks. Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation. We surpass previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins. Additionally, our method is designed to be a generic plug-and-play module for cross-modality alignment in RIS task, making it easy to integrate with other RIS models for substantial performance improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes