ROCVSep 23, 2025

Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

arXiv:2509.19571v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of complex instructions and new scenes in robotics, offering a more capable language-conditioned policy for tasks like manipulation and navigation, though it appears incremental by building on existing scene representation methods.

The paper tackles the problem of executing open-ended natural language queries in robotics by introducing Agentic Scene Policies (ASP), a framework that uses explicit scene representations for semantic, spatial, and affordance-based querying, resulting in improved performance on tabletop manipulation and room-level navigation tasks compared to vision-language-action models.

Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes