ROCVApr 9, 2025

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

arXiv:2504.06553v38 citationsh-index: 9CVPR
Originality Highly original
AI Analysis

This addresses the problem of enabling AI systems to interpret and execute complex instructions in physical environments, which is incremental as it builds on existing scene reconstruction and understanding work.

The paper tackles the challenge of grounding abstract, high-level instructions to 3D scenes by proposing ASHiTA, a framework that generates task hierarchies grounded in 3D scene graphs, resulting in significantly better performance than LLM baselines in task breakdown and comparable grounding to state-of-the-art methods.

While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes