ROAICVFeb 18, 2025

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Peking UStanford
arXiv:2502.13143v247 citationsh-index: 18
Originality Highly original
AI Analysis

This addresses the limitation of traditional pose representations in robotics and spatial reasoning by enabling more generalizable and semantically grounded orientation for object manipulation.

The paper tackles the problem of object orientation in 6-DoF fine-grained manipulation by introducing semantic orientation, which uses natural language to define orientations without pre-defined frames, and results in zero-shot success rates of 48.7% on Open6DOR and 74.9% on SIMPLER-Env.

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes