CVDec 19, 2022

SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation

arXiv:2212.09329v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work improves scene graph generation for computer vision applications, but it is incremental as it builds on existing transformer and visual-linguistic methods.

The paper tackled the problem of scene graph generation by addressing the lack of self-reasoning ability and neglect of linguistic modality in one-stage methods, proposing SrTR which achieved superior performance and fast inference on the Visual Genome dataset.

Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a self-reasoning decoder is developed to complete three inferences of the triplet set, s+o-p, s+p-o and p+o-s. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced and a visual-linguistic alignment strategy is designed to project visual representations into semantic spaces with prior knowledge to aid relational reasoning. Experiments on the Visual Genome dataset demonstrate the superiority and fast inference ability of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes