VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection
This work addresses a specific bottleneck in VRD for computer vision applications, offering incremental improvements over existing methods.
The paper tackled the problem of redundant and incorrect predicate predictions in Visual Relationship Detection (VRD) by proposing VReBERT, a BERT-like transformer model that jointly processes visual and semantic features, achieving state-of-the-art performance in predicate prediction and significant improvements in zero-shot predicate prediction (e.g., +8.49 R@50 and +8.99 R@100).
Visual Relationship Detection (VRD) impels a computer vision model to 'see' beyond an individual object instance and 'understand' how different objects in a scene are related. The traditional way of VRD is first to detect objects in an image and then separately predict the relationship between the detected object instances. Such a disjoint approach is prone to predict redundant relationship tags (i.e., predicate) between the same object pair with similar semantic meaning, or incorrect ones that have a similar meaning to the ground truth but are semantically incorrect. To remedy this, we propose to jointly train a VRD model with visual object features and semantic relationship features. To this end, we propose VReBERT, a BERT-like transformer model for Visual Relationship Detection with a multi-stage training strategy to jointly process visual and semantic features. We show that our simple BERT-like model is able to outperform the state-of-the-art VRD models in predicate prediction. Furthermore, we show that by using the pre-trained VReBERT model, our model pushes the state-of-the-art zero-shot predicate prediction by a significant margin (+8.49 R@50 and +8.99 R@100).