CVNov 2, 2023

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

arXiv:2311.01091v28 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses the challenge of visual-linguistic interaction in multimodal tasks for computer vision researchers, representing an incremental improvement over existing methods.

The paper tackles the problem of panoptic narrative grounding by proposing a method that enriches phrases with both pixel and object contexts, achieving new state-of-the-art performance with large margins on the PNG benchmark.

Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes