CVAIROMar 12

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

arXiv:2603.11554v117.2h-index: 5
Predicted impact top 48% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the need for complex, multi-floor environments to develop and evaluate embodied AI agents, though it is incremental as it builds on existing language-to-3D generation methods.

The paper tackles the problem of generating building-scale, multi-floor 3D environments for long-horizon robotic tasks by introducing MANSION, a language-driven framework that produces realistic, navigable structures, resulting in a dataset of over 1,000 buildings and showing that state-of-the-art agents degrade sharply in these settings.

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes