Optimization-Guided Diffusion for Interactive Scene Generation
This work addresses the need for safety-critical scenarios in autonomous vehicle testing, offering a controllable and realistic scene generation method, though it is incremental as it builds on existing diffusion models with optimization enhancements.
The paper tackled the problem of generating realistic and diverse multi-agent driving scenes for autonomous vehicle evaluation by introducing OMEGA, an optimization-guided diffusion framework that enforces physical and social constraints, resulting in improvements such as increasing physically and behaviorally valid scenes from 32.35% to 72.27% and generating 5× more near-collision frames.
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.