ROLGJun 1

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

arXiv:2606.0184784.5
AI Analysis

This work addresses a fundamental geometric error in robotic manipulation policies, improving performance for tasks requiring precise SE(3) pose estimation.

Diffusion-based Vision-Language-Action policies suffer from the Euclidean Fallacy of representing SE(3) poses as flat vectors, causing manifold drift and broken equivariance. The proposed Lie Diffuser Actor operates intrinsically on SE(3), achieving a 7.3% improvement in average task length on CALVIN ABC→D and outperforming baselines on real robot tasks.

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes