CVApr 29, 2021

Segmentation-grounded Scene Graph Generation

arXiv:2104.14207v132 citations
Originality Incremental advance
AI Analysis

This work addresses the lack of pixel-level grounding in scene graphs for computer vision applications, offering an incremental improvement by integrating segmentation from auxiliary datasets.

The paper tackles the problem of scene graph generation by introducing a framework that grounds objects and relations at the pixel level using segmentation masks, improving relation prediction through a novel Gaussian attention mechanism.

Scene graph generation has emerged as an important problem in computer vision. While scene graphs provide a grounded representation of objects, their locations and relations in an image, they do so only at the granularity of proposal bounding boxes. In this work, we propose the first, to our knowledge, framework for pixel-level segmentation-grounded scene graph generation. Our framework is agnostic to the underlying scene graph generation method and address the lack of segmentation annotations in target scene graph datasets (e.g., Visual Genome) through transfer and multi-task learning from, and with, an auxiliary dataset (e.g., MS COCO). Specifically, each target object being detected is endowed with a segmentation mask, which is expressed as a lingual-similarity weighted linear combination over categories that have annotations present in an auxiliary dataset. These inferred masks, along with a novel Gaussian attention mechanism which grounds the relations at a pixel-level within the image, allow for improved relation prediction. The entire framework is end-to-end trainable and is learned in a multi-task manner with both target and auxiliary datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes