CVApr 12

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

arXiv:2604.1077231.3h-index: 3
AI Analysis

For embodied AI and VR interaction, this work provides a text-driven method for 3D scene synthesis that improves semantic and physical consistency while enabling real-time editing.

HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs and VLMs, achieving more reasonable environments than existing baselines.

3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes