Controllable Diverse Sampling for Diffusion Based Motion Behavior Forecasting
This addresses trajectory prediction for autonomous vehicles in complex urban environments, offering an incremental improvement over existing methods by mitigating mode averaging and collapse issues.
The paper tackles the problem of generating diverse and scene-compliant trajectories for autonomous driving by introducing the Controllable Diffusion Trajectory (CDT) model, which integrates map and social information into a diffusion process with behavioral tokens, achieving strong performance on the Argoverse 2 benchmark.
In autonomous driving tasks, trajectory prediction in complex traffic environments requires adherence to real-world context conditions and behavior multimodalities. Existing methods predominantly rely on prior assumptions or generative models trained on curated data to learn road agents' stochastic behavior bounded by scene constraints. However, they often face mode averaging issues due to data imbalance and simplistic priors, and could even suffer from mode collapse due to unstable training and single ground truth supervision. These issues lead the existing methods to a loss of predictive diversity and adherence to the scene constraints. To address these challenges, we introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT), which integrates map information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories. To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left. Moreover, we incorporate the predicted endpoints as an alternative behavioral token into the CDT model to facilitate the prediction of accurate trajectories. Extensive experiments on the Argoverse 2 benchmark demonstrate that CDT excels in generating diverse and scene-compliant trajectories in complex urban settings.