CVFeb 10, 2025

Visual Agentic AI for Spatial Reasoning with a Dynamic API

arXiv:2502.06787v237 citationsh-index: 7CVPR
Originality Incremental advance
AI Analysis

This addresses the challenge of 3D spatial reasoning for embodied agents, representing an incremental improvement over static API methods.

The paper tackles the problem of 3D spatial reasoning in AI by introducing an agentic program synthesis approach where LLM agents generate a dynamic Pythonic API to solve subproblems, outperforming prior zero-shot models for visual reasoning in 3D.

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes