CVJan 16, 2023

Linguistic Query-Guided Mask Generation for Referring Image Segmentation

arXiv:2301.06429v35 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses the challenge of handling diverse text-image pairs in referring image segmentation, which is incremental as it builds on existing methods by introducing dynamic prototype generation.

The paper tackles the problem of referring image segmentation by proposing LGFormer, an end-to-end transformer framework that uses linguistic features as queries to generate specialized prototypes for arbitrary image-text pairs, resulting in more consistent segmentation outcomes.

Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. Existing methods either adopt the pixel classification-based or the learnable query-based framework for mask generation, both of which are insufficient to deal with various text-image pairs with a fix number of parametric prototypes. In this work, we propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation, dubbed LGFormer. It views the linguistic features as query to generate a specialized prototype for arbitrary input image-text pair, thus generating more consistent segmentation results. Moreover, we design several cross-modal interaction modules (\eg, vision-language bidirectional attention module, VLBA) in both encoder and decoder to achieve better cross-modal alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes