From Prompts to Worlds: How Users Iterate, Explore, and Make Sense of AI-Generated 3D Environments
For designers of text-to-3D systems, this work identifies key usability challenges and proposes design implications for hybrid input and transparent feedback.
This paper presents the first empirical study of a commercial text-to-3D platform, finding that users can convey semantic intent but struggle with spatial specification, leading to episodic presence and iterative breakdowns due to interaction barriers rather than user limitations.
Text-to-3D generative AI systems create navigable environments from natural language prompts, but unlike text-to-image generation, evaluation requires embodied exploration of spatial coherence, scale, and navigability. We present the first empirical study of a commercial text-to-3D platform, combining think-aloud protocols, behavioral observation, and validated measures of usability, presence, and engagement. We report three findings. First, asymmetric expressibility: users readily convey semantic intent (themes, atmosphere) but struggle to specify spatial structure (layout, scale), reflecting a language-to-space limitation rather than a skill deficit. Second, episodic presence: immersion arises when expectations align with outputs but does not accumulate into sustained place illusion. Third, structural iteration breakdowns: refinement fails due to interaction barriers - poor discoverability, opaque feedback, and high temporal costs - rather than user limitations. Together, these dynamics form a reinforcing cycle in which spatial mismatches persist, producing episodic presence and ongoing sensemaking. We reframe text-to-3D interaction as negotiated meaning-making rather than linear prompting, and argue that effective systems require hybrid input modalities, transparent feedback, and low-cost iteration.