Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
This work addresses biases in scene graph generation for computer vision applications, offering an incremental improvement over existing methods.
The paper tackles the problem of scene graph generation (SGG) by addressing biases in training data that lead to poor generalization, proposing a framework that imputes missing informative relations to improve model training. The result is significant performance gains on the Visual Genome benchmark, including in zero-shot settings for unseen triplets.
Scene graph generation (SGG) aims to capture a wide variety of interactions between pairs of objects, which is essential for full scene understanding. Existing SGG methods trained on the entire set of relations fail to acquire complex reasoning about visual and textual correlations due to various biases in training data. Learning on trivial relations that indicate generic spatial configuration like 'on' instead of informative relations such as 'parked on' does not enforce this complex reasoning, harming generalization. To address this problem, we propose a novel framework for SGG training that exploits relation labels based on their informativeness. Our model-agnostic training procedure imputes missing informative relations for less informative samples in the training data and trains a SGG model on the imputed labels along with existing annotations. We show that this approach can successfully be used in conjunction with state-of-the-art SGG methods and improves their performance significantly in multiple metrics on the standard Visual Genome benchmark. Furthermore, we obtain considerable improvements for unseen triplets in a more challenging zero-shot setting.