CVOct 15, 2025

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

arXiv:2510.13375v131 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

It addresses spatial reasoning deficiencies in VLA models for robotics and manipulation tasks, representing an incremental improvement.

The paper tackles the problem of limited spatial reasoning in Vision-Language-Action models, resulting in improved performance with DepthVLA achieving 78.5% vs. 65.0% in real-world tasks and 74.8% vs. 58.8% in a simulator.

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes