Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models
For researchers developing end-to-end autonomous driving systems, this work highlights that current models lack robust safe-driving reasoning in common hazardous scenarios, despite high performance on standard benchmarks.
Safe2Drive introduces 100 safety-critical driving scenarios and a new metric (SDS) to evaluate end-to-end autonomous driving policies. Two state-of-the-art models (LEAD and SimLingo) show drastic performance drops (e.g., LEAD from 94.70 to 39.95 DS) and low SDS scores (11.85 and 15.27), revealing brittle safe-driving behaviors.
Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.