CVLGROJun 13, 2025

Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

arXiv:2506.12251v25 citationsh-index: 32IEEE Robot Autom Lett
Originality Incremental advance
AI Analysis

This addresses the need for real-time feasibility of transformer-based policies on embedded hardware in autonomous vehicles, though it is incremental as it builds on existing triplane and tokenization methods.

The paper tackles the problem of efficiently tokenizing multi-camera sensor data for end-to-end autonomous driving policies, resulting in up to 72% fewer tokens, 50% faster inference, and maintained or improved driving accuracy.

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes