ROAISep 19, 2025

Compose by Focus: Scene Graph-based Atomic Skills

arXiv:2509.16053v13 citationsh-index: 3
Originality Highly original
AI Analysis

This work addresses the problem of compositional generalization for generalist robots, offering an incremental improvement by mitigating sensitivity to irrelevant scene variations.

The paper tackles the challenge of robust execution of atomic skills in robots under distribution shifts from scene composition, introducing a scene graph-based representation and framework that integrates graph neural networks with diffusion-based imitation learning, achieving substantially higher success rates in simulation and real-world manipulation tasks.

A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes