CVMar 12

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

arXiv:2603.12238v138.9h-index: 19Has Code
Predicted impact top 10% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the limitation of existing methods for digital content creation by enabling unconstrained, open-vocabulary 3D scene synthesis, though it appears incremental as it builds on modern 3D object generation and VLMs.

The paper tackles the problem of open-vocabulary 3D scene generation from natural language by introducing SceneAssistant, a visual-feedback-driven agent that iteratively refines scenes using Vision-Language Models and atomic operations, resulting in diverse, high-quality scenes as demonstrated by qualitative analysis and quantitative human evaluations.

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes