CVCLLGNov 22, 2024

Instance-Aware Generalized Referring Expression Segmentation

arXiv:2411.15087v12 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of precise segmentation for multiple objects in images based on text queries, which is incremental as it builds on existing GRES methods by adding instance-level differentiation.

The paper tackles the problem of Generalized Referring Expression Segmentation (GRES) struggling with complex expressions referring to multiple objects, and proposes InstAlign to incorporate object-level reasoning, achieving significant state-of-the-art performance on benchmarks like gRefCOCO and Ref-ZOM.

Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes