CVApr 12

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

Haiyan Jiang, Deyu Zhang, Dongdong Weng, Weitao Song, Henry Been-Lirn Duh

arXiv:2604.1077231.3h-index: 3

AI Analysis

For embodied AI and VR interaction, this work provides a text-driven method for 3D scene synthesis that improves semantic and physical consistency while enabling real-time editing.

HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs and VLMs, achieving more reasonable environments than existing baselines.

3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

View on arXiv PDF

Similar