CVAIRODec 18, 2025

DVGT: Driving Visual Geometry Transformer

arXiv:2512.16919v110 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This addresses the need for flexible and accurate 3D scene reconstruction in autonomous driving, though it appears incremental as it builds on transformer architectures for a specific domain.

The paper tackles the problem of dense 3D geometry perception for autonomous driving by proposing DVGT, a transformer-based model that reconstructs metric-scaled 3D point maps from unposed multi-view images without relying on camera parameters, achieving significant performance improvements over existing models on mixed driving datasets.

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes