CVOct 2, 2023

Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models

arXiv:2310.01356v15.03 citations

Originality Incremental advance

AI Analysis

This addresses the need for advanced comprehension and reasoning in perception systems for downstream AI tasks, though it appears incremental as it builds on existing foundation models.

The paper tackles the problem of extracting structured visual information by proposing a new task called Local Scene Graph Generation, which focuses on abstracting partial objects and relationships rather than all elements in an image. The result is a zero-shot framework (ELEGANT) that achieves up to 24.58% performance improvement over prior methods in close-set evaluation.

Humans inherently recognize objects via selective visual perception, transform specific regions from the visual field into structured symbolic knowledge, and reason their relationships among regions based on the allocation of limited attention resources in line with humans' goals. While it is intuitive for humans, contemporary perception systems falter in extracting structural information due to the intricate cognitive abilities and commonsense knowledge required. To fill this gap, we present a new task called Local Scene Graph Generation. Distinct from the conventional scene graph generation task, which encompasses generating all objects and relationships in an image, our proposed task aims to abstract pertinent structural information with partial objects and their relationships for boosting downstream tasks that demand advanced comprehension and reasoning capabilities. Correspondingly, we introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework harnessing foundation models renowned for their powerful perception and commonsense reasoning, where collaboration and information communication among foundation models yield superior outcomes and realize zero-shot local scene graph generation without requiring labeled supervision. Furthermore, we propose a novel open-ended evaluation metric, Entity-level CLIPScorE (ECLIPSE), surpassing previous closed-set evaluation metrics by transcending their limited label space, offering a broader assessment. Experiment results show that our approach markedly outperforms baselines in the open-ended evaluation setting, and it also achieves a significant performance boost of up to 24.58% over prior methods in the close-set setting, demonstrating the effectiveness and powerful reasoning ability of our proposed framework.

View on arXiv PDF

Similar