CVRONov 30, 2025

TrajDiff: End-to-end Autonomous Driving without Perception Annotation

arXiv:2512.00723v15 citationsh-index: 10
Originality Highly original
AI Analysis

This work addresses the cost and scalability issues in autonomous driving systems for developers and researchers by offering a perception annotation-free method that is competitive with perception-based approaches.

The paper tackles the problem of high-cost manual perception annotation in end-to-end autonomous driving by proposing TrajDiff, a trajectory-oriented diffusion framework that eliminates the need for perception annotation, achieving 87.5 PDMS on the NAVSIM benchmark and improving to 88.5 PDMS with data scaling.

End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes