Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent
This addresses the need for accurate, multimodal, and interpretable traffic models in autonomous vehicle planning and simulation, with incremental improvements in interpretability and LLM compatibility.
The paper tackled the problem of predicting diverse and interpretable traffic behaviors for autonomous vehicles by introducing Categorical Traffic Transformer (CTT), which outputs both continuous trajectories and tokenized categorical predictions, achieving state-of-the-art accuracy and avoiding mode collapse.
Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.