CVAILGROMar 20, 2023

3D Concept Learning and Reasoning from Multi-View Images

MIT
arXiv:2303.11327v184 citationsh-index: 137
Originality Incremental advance
AI Analysis

This addresses the problem of enabling AI systems to reason in 3D from multi-view observations, which is incremental as it builds on existing methods with a new benchmark and framework.

The paper tackles 3D visual reasoning from multi-view images by introducing a new large-scale benchmark (3DMV-VQA) with 5k scenes and 50k questions, and proposes a 3D-CLR framework that outperforms baselines by a large margin, though the challenge remains unsolved.

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes