AIROSep 6, 2025

OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision

arXiv:2509.05578v113 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses the problem of enabling scalable and interpretable 3D reasoning for autonomous driving systems without expensive manual annotations.

The paper tackled the lack of robust 3D spatial understanding in multimodal large language models for autonomous driving by proposing OccVLA, which integrates 3D occupancy representations as implicit supervision, achieving state-of-the-art results on the nuScenes benchmark for trajectory planning and superior performance on 3D visual question-answering tasks.

Multimodal large language models (MLLMs) have shown strong vision-language reasoning abilities but still lack robust 3D spatial understanding, which is critical for autonomous driving. This limitation stems from two key challenges: (1) the difficulty of constructing accessible yet effective 3D representations without expensive manual annotations, and (2) the loss of fine-grained spatial details in VLMs due to the absence of large-scale 3D vision-language pretraining. To address these challenges, we propose OccVLA, a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process. Unlike prior approaches that rely on explicit 3D inputs, OccVLA treats dense 3D occupancy as both a predictive output and a supervisory signal, enabling the model to learn fine-grained spatial structures directly from 2D visual inputs. The occupancy predictions are regarded as implicit reasoning processes and can be skipped during inference without performance degradation, thereby adding no extra computational overhead. OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks, offering a scalable, interpretable, and fully vision-based solution for autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes