Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation
For scene graph generation researchers, this work offers a practical plug-in to improve rare relation predictions, though it is an incremental enhancement to existing methods.
Scene Graph Generation models struggle with rare relation types due to annotation sparsity. The authors propose a model-agnostic framework that mines commonsense constraints from data and uses reasoning to refine predictions, achieving consistent improvements across three benchmarks without retraining.
Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.