CV AI LGMay 10, 2022

Weakly-supervised segmentation of referring expressions

Robin Strudel, Ivan Laptev, Cordelia Schmid

arXiv:2205.04725v215.630 citationsh-index: 151

Originality Incremental advance

AI Analysis

This addresses the scalability issue of manual annotation for visual grounding by enabling segmentation from natural language with weak supervision, which is incremental as it adapts existing methods to a new task setting.

The paper tackles the problem of image segmentation from referring expressions without pixel-level annotations, introducing a weakly-supervised approach called TSEG that learns segmentation masks directly from image-level text, achieving promising results on PhraseCut and RefCOCO datasets and competitive zero-shot performance on Pascal VOC.

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.

View on arXiv PDF

Similar