PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
This addresses the issue of poor predicate diversity and degraded downstream performance in SGG for applications like image-based reasoning, though it is incremental as it builds on existing foundation models.
The paper tackles the problem of training bias and limited predicate diversity in Scene Graph Generation (SGG) by introducing PRISM-0, a zero-shot open-vocabulary framework that leverages foundation models to generate diverse predicates, achieving performance on par with state-of-the-art weakly-supervised and supervised methods on benchmarks like Visual Genome.
In Scene Graph Generation (SGG), structured representations are extracted from visual inputs as object nodes and connecting predicates, enabling image-based reasoning for diverse downstream tasks. While fully supervised SGG has improved steadily, it suffers from training bias due to limited curated data and long-tail predicate distributions, leading to poor predicate diversity and degraded downstream performance. We present PRISM-0, a zero-shot open-vocabulary SGG framework that leverages foundation models in a bottom-up pipeline to capture a broad spectrum of predicates. Detected object pairs are filtered, described via a Vision-Language Model (VLM), and processed by a Large Language Model (LLM) to generate fine- and coarse-grained predicates, which are then validated by a Visual Question Answering (VQA) model. PRISM-0 modular, dataset-independent design enriches existing SGG datasets such as Visual Genome and produces diverse, unbiased graphs. While operating entirely in a zero-shot setting, PRISM-0 achieves performance on par with state-of-the-art weakly-supervised models on SGG benchmarks and even state-of-the-art supervised methods in tasks such as Sentence-to-Graph Retrieval.