CVLGApr 1, 2025

PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks

arXiv:2504.00844v25 citationsh-index: 58
Originality Incremental advance
AI Analysis

This addresses the issue of poor predicate diversity and degraded downstream performance in SGG for applications like image-based reasoning, though it is incremental as it builds on existing foundation models.

The paper tackles the problem of training bias and limited predicate diversity in Scene Graph Generation (SGG) by introducing PRISM-0, a zero-shot open-vocabulary framework that leverages foundation models to generate diverse predicates, achieving performance on par with state-of-the-art weakly-supervised and supervised methods on benchmarks like Visual Genome.

In Scene Graph Generation (SGG), structured representations are extracted from visual inputs as object nodes and connecting predicates, enabling image-based reasoning for diverse downstream tasks. While fully supervised SGG has improved steadily, it suffers from training bias due to limited curated data and long-tail predicate distributions, leading to poor predicate diversity and degraded downstream performance. We present PRISM-0, a zero-shot open-vocabulary SGG framework that leverages foundation models in a bottom-up pipeline to capture a broad spectrum of predicates. Detected object pairs are filtered, described via a Vision-Language Model (VLM), and processed by a Large Language Model (LLM) to generate fine- and coarse-grained predicates, which are then validated by a Visual Question Answering (VQA) model. PRISM-0 modular, dataset-independent design enriches existing SGG datasets such as Visual Genome and produces diverse, unbiased graphs. While operating entirely in a zero-shot setting, PRISM-0 achieves performance on par with state-of-the-art weakly-supervised models on SGG benchmarks and even state-of-the-art supervised methods in tasks such as Sentence-to-Graph Retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes