CVOct 2, 2025

Visual Odometry with Transformers

arXiv:2510.03348v21 citationsh-index: 67
AI Analysis

This work addresses the speed and scalability issues in visual odometry for robotics and autonomous systems, representing an incremental improvement by streamlining the pipeline with an end-to-end transformer-based method.

The paper tackles the problem of slow and non-scalable classical visual odometry methods by introducing Visual Odometry Transformer (VoT), which formulates monocular visual odometry as a direct relative pose regression problem, resulting in up to 4 times faster speed than traditional approaches and 10 times faster than recent 3D foundation models with competitive or better performance.

Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of which involve complex bundle adjustment and feature-matching processes. Although disregarded in the literature, we find it problematic in terms of both (1) speed, that performs bundle adjustment requires a significant amount of time, and (2) scalability, as hand-crafted components struggle to learn from large-scale training data. In this work, we introduce a simple yet efficient architecture, Visual Odometry Transformer (VoT), that formulates monocular visual odometry as a direct relative pose regression problem. Our approach streamlines the monocular visual odometry pipeline in an end-to-end manner, effectively eliminating the need for handcrafted components such as bundle adjustment, feature matching, or camera calibration. We show that VoT is up to 4 times faster than traditional approaches, yet with competitive or better performance. Compared to recent 3D foundation models, VoT runs 10 times faster with strong scaling behavior in terms of both model sizes and training data. Moreover, VoT generalizes well in both low-data regimes and previously unseen scenarios, reducing the gap between optimization-based and end-to-end approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes