CVCLJan 15, 2025

Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

arXiv:2501.09041v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the challenge of reliable visual commonsense reasoning for AI systems by better exploiting object relationships, though it appears incremental as it builds on existing scene graph and reasoning methods.

The paper tackles the problem of visual commonsense reasoning by proposing a method that constructs a location-free scene graph from image patches and LLMs to improve scene comprehension, resulting in effective performance on tasks like scene graph constructing and visual commonsense answering and explaining as shown in experiments.

Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene's details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit{\textbf{G2}}, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes