CVAug 12, 2021

Vision-Language Transformer and Query Generation for Referring Segmentation

arXiv:2108.05565v1365 citationsHas Code
Originality Highly original
AI Analysis

This addresses the problem of accurately segmenting objects based on natural language descriptions for computer vision applications, representing a strong incremental improvement.

The paper tackles referring segmentation by reformulating it as a direct attention problem, using a transformer-based network with a query generation module, achieving new state-of-the-art performance on RefCOCO, RefCOCO+, and G-Ref datasets.

In this work, we address the challenging task of referring segmentation. The query expression in referring segmentation typically indicates the target object by describing its relationship with others. Therefore, to find the target one among all instances in the image, the model must have a holistic understanding of the whole image. To achieve this, we reformulate referring segmentation as a direct attention problem: finding the region in the image where the query language expression is most attended to. We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Furthermore, we propose a Query Generation Module, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects. At the same time, to find the best way from these diversified comprehensions based on visual clues, we further propose a Query Balance Module to adaptively select the output features of these queries for a better mask generation. Without bells and whistles, our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our code is available at https://github.com/henghuiding/Vision-Language-Transformer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes