CV AI ROJun 21, 2025

DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving

Mihir Godbole, Xiangbo Gao, Zhengzhong Tu

arXiv:2506.17590v210.25 citationsh-index: 7Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses safety-critical decision-making for autonomous vehicles in urban scenarios, but it is incremental as it builds on existing vision-language models and datasets.

The paper tackles the problem of fine-grained intent prediction and risk reasoning for vulnerable road users in autonomous driving by introducing DRAMA-X, a benchmark with 5,686 frames and nine-class intent labels, and shows that scene-graph-based reasoning improves intent prediction and risk assessment.

Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.

View on arXiv PDF Code

Similar