Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction
This work addresses the challenge of learning robust visual representations for VRD, which is important for applications like scene understanding, but it is incremental as it adapts existing masked modeling techniques to a specific domain.
The paper tackles the problem of Visual Relationship Detection (VRD) by proposing a self-supervised method called Masked Bounding Box Reconstruction (MBBR), which learns context-aware representations through object-level masked modeling, and it surpasses state-of-the-art methods on the Predicate Detection setting using only a few annotated samples.
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at https://github.com/deeplab-ai/SelfSupervisedVRD.