RO CVMay 12

Driving Intents Amplify Planning-Oriented Reinforcement Learning

Hengtong Lu, Victor Shea-Jay Huang, Chengmin Yang, Pengfei Jing, Jifeng Dai, Yan Xie, Benjin Zhu

arXiv:2605.1262596.7

Predicted impact top 4% in RO · last 90 daysOriginality Incremental advance

AI Analysis

For autonomous driving, this work solves the problem of preference-aligned policy optimization under mode collapse, achieving the first super-human performance on WOD-E2E.

The paper addresses mode collapse in continuous-action driving policies trained from single demonstrations, where samples cluster around the demonstrated maneuver. DIAL, a two-stage framework using intent-conditioned flow matching and multi-intent GRPO, achieves a best-of-128 RFS of 9.14, surpassing the human demonstration (8.13) and prior best (8.5), and improves held-out RFS from 7.681 to 8.211.

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

View on arXiv PDF

Similar